Splunk® Cloud Services

SPL2 Search Reference

Acrobat logo Download manual as PDF


Acrobat logo Download topic as PDF

Aggregate functions

Aggregate functions summarize the values from each event to create a single, meaningful value. Common aggregate functions include Average, Count, Minimum, Maximum, Standard Deviation, Sum, and Variance.

Most aggregate functions are used with numeric fields. However, there are some functions that you can use with either alphabetic string fields or numeric fields. The function descriptions indicate which functions you can use with alphabetic strings.

For an overview, see Overview of SPL2 stats functions.

avg(<value>)

This function returns the average, or mean, of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The following example returns the average of the values in the size field for each distinct value in the host field.

... | stats avg(size) BY host


The following example returns the average thruput of each host for each 5 minute time span.

... | bin _time span=5m | stats avg(thruput) BY host


The following example displays a timechart of the average of the cpu_seconds field by processor, rounded to 2 decimal points.

... | timechart eval(round(avg(cpu_seconds),2)) BY processor

When you use a eval expression with the timechart command, you must also use BY clause.

count(<value>) or c(<value>)

This function returns the number of occurrences in a field.

Usage

To use this function, you can specify count(<value>), or the abbreviation c(<value>).

This function processes field values as strings.

To indicate a specific field value to match, use the format <field>=<value>.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

Several of these examples use an eval expression with the count function. See Using eval expressions in stats functions.


The following example returns the count of events where the status field has the value "404".

...| stats count(eval(status="404")) AS count_status BY sourcetype


The following example separates the search results into 10 bins and counts the values in the _raw field for each bin.

... | bin size bins=10 | stats count(_raw) BY size


You can use search literals in functions that accept predicate expressions. Search literals are enclosed in backtick characters ( ` ). The following search uses a search literal to count the occurrences of the value 500 in your events. The results are organized by the host field:

... | stats count(`500`) by host

For more information, see Search literals in expressions.


The following example uses the timechart command to count the events where the action field contains the value purchase.

| from my_dataset where sourcetype="access_*" | timechart count(eval(action="purchase")) BY productName usenull=f useother=f


distinct_count(<value>) or dc(<value>)

This function returns the count of distinct values in a field.

Usage

To use this function, you can specify distinct_count(), or the abbreviation dc().

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic examples

The following example removes duplicate results with the same host value and returns the total count of the remaining results.

... | stats dc(host)


The following example generates the distinct count of the values in the devices field. The results are returned in a field called numdevices.

...| stats dc(device) AS numdevices


The following example counts the distinct client IP addresses for each host and category, and then bins the count for each result into five minute spans.

... | stats dc(clientip) BY host, categoryId | bin _time span=5m

Extended example

The following search counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The search organizes the count by the type of product , such as accessories, t-shirts, and type of games, that customers purchased.

| from my_dataset where sourcetype="access_*" action=purchase | stats dc(clientip) BY categoryId

  • This example first searches for purchase events, action=purchase.
  • These results are piped into the stats command and the dc() function counts the number of distinct users who made purchases.
  • The BY clause is used to organize the distinct count based on the different category of products, the categoryId.


The results look like this:

categoryId dc(clientip)
ACCESSORIES 37
ARCADE 58
NULL 8
SHOOTER 31
SIMULATION 34
SPORTS 13
STRATEGY 74
TEE 38

estdc(<value>)

This function returns the estimated count of the distinct values in a field.

Usage

This function processes field values as strings.

The string values 1.0 and 1 are considered distinct values and counted separately.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The following example removes duplicate results with the same host value in a field, and returns the estimated total count of the remaining results.

... | stats estdc(host)

The results look like this:

estdc(host)
6

The following example generates the estimated distinct count of the devices field and renames the results field, numdevices.

...| stats estdc(devices) AS numdevices


The following example estimates the distinct count for the values in the source field for each sourcetype. The results are organized into 5 minute time spans.

... | stats estdc(source) BY sourcetype | bin 5m

estdc_error(<value>)

This function returns the theoretical error of the estimated count of the distinct values in a field.

The error represents this ratio:

absolute_value(estimate_distinct_count - real_distinct_count) / real_distinct_count

Usage

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Example

The following example determines the error ratio for the estimated distinct count of the values in the host field.

... | stats estdc_error(host)

exactperc(<value>,<percentile>)

This function returns an exact percentile based on the values in a numeric field.

The exactperc function provides the exact value, but is very resource expensive for high cardinality fields. The exactperc function can consume a large amount of memory, which might impact how long it takes for a search to complete.

Usage

There are three percentile functions: exactperc, perc, and upperperc. For details about the differences, see the perc function.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

See the perc function.

max(<value>)

This function returns the maximum value in a field.

Usage

This function processes field values as numbers if possible, otherwise processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If the values in the field are non-numeric, the maximum value is found using lexicographical ordering.

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

  • Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
  • Uppercase letters are sorted before lowercase letters.
  • Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Basic example

This example returns the maximum value of the size field.

... | stats max(size)

Extended example

Calculate aggregate statistics for the magnitudes of earthquakes in an area

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Search for earthquakes in and around California. Calculate the number of earthquakes that were recorded. Use statistical functions to calculate the minimum, maximum, range (the difference between the min and max), and average magnitudes of the recent earthquakes. List the values by magnitude type.

source=all_month.csv place=*California* | stats count, max(mag), min(mag), range(mag), avg(mag) BY magType

The results look like this:

magType count max(mag) min(mag) range(mag) avg(mag)
H 123 2.8 0.0 2.8 0.549593
MbLg 1 0 0 0 0.0000000
Md 1565 3.2 0.1 3.1 1.056486
Me 2 2.0 1.6 .04 1.800000
Ml 1202 4.3 -0.4 4.7 1.226622
Mw 6 4.9 3.0 1.9 3.650000
ml 10 1.56 0.19 1.37 0.934000

mean(<value>)

This function returns the arithmetic mean of the values in a field.

The mean values should be exactly the same as the values calculated using the avg() function.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

The following example returns the mean of the values in the kbps field.

...| stats mean(kbps)

Extended example

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.

source=usgs place=*California* | stats count mean(mag), stdev(mag), var(mag) BY magType

The results look like this:

magType count mean(mag) std(mag) var(mag)
H 123 0.549593 0.356985 0.127438
MbLg 1 0.000000 0.000000 0.000000
Md 1565 1.056486 0.580042 0.336449
Me 2 1.800000 0.346410 0.120000
Ml 1202 1.226622 0.629664 0.396476
Mw 6 3.650000 0.716240 0.513000
ml 10 0.934000 0.560401 0.314049

median(<value>)

This function returns the middle-most value of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If you have an even number of search results, the median calculation is the average of the two middle-most values.

Basic example

Consider the following list of values, which counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The values are organized by the type of product (accessories, t-shirts, and type of games) that customers purchased.

categoryId count
ACCESSORIES 37
ARCADE 58
NULL 8
SIMULATION 34
SPORTS 13
STRATEGY 74
TEE 38

When the list is sorted the the middle-most value is 37.

categoryId count
NULL 8
SPORTS 13
SIMULATION 34
ACCESSORIES 37
TEE 38
ARCADE 58
STRATEGY 74

The middle-most value is returned when there are an odd number of results. When there are an even number of results, the average of the two middle-most numbers is returned.

min(<value>)

This function returns the minimum value in a field.

Usage

This function processes field values as numbers if possible, otherwise processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If the values in the field are non-numeric, the minimum value is found using lexicographical ordering.

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

  • Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
  • Uppercase letters are sorted before lowercase letters.
  • Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Basic examples

The following example returns the minimum size and maximum size of the HotBucketRoller component in the _internal index.

index=_internal component=HotBucketRoller | stats min(size), max(size)


The following example returns a list of processors and calculates the minimum cpu_seconds and the maximum cpu_seconds.

index=_internal | chart min(cpu_seconds), max(cpu_seconds) BY processor

Extended example

See the Extended example for the max() function. That example includes the min() function.

mode(<value>)

This function returns the most frequent value in a field.

Usage

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The mode returns the most frequent value. Consider the following data:

firstname surname age
Claudia Garcia 32
David Mayer 45
Alex Garcia 29
Wei Zhang 45
Javier Garcia 37

When you search for the mode in the age field, the value 45 is returned.

...| stats mode(age)


You can also use mode with fields that contain string values. When you search for the mode in the surname field, the value Garcia is returned.

...| stats mode(surname)


Here's another set of sample data:

_time host sourcetype
04-06-2020 17:06:23.000 PM www1 access_combined
04-06-2020 10:34:19.000 AM www1 access_combined
04-03-2020 13:52:18.000 PM www2 access_combined
04-02-2020 07:39:59.000 AM www3 access_combined
04-01-2020 19:35:58.000 PM www1 access_combined

If you run a search that looks for the mode in the host field, the value www1 is returned because it is the most common value in the host field. For example:

... |stats mode(host)

The results look something like this:

mode(host)
www1

perc(<value>,<percentile>)

This function returns an approximate percentile based on the values in a numeric field.

There are three percentile functions. These functions return the nth percentile of the values in a numeric field. You can think of this as an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the values in the field are below the estimate and 5% of the values in the field are above the estimate.

Valid percentiles are floating point numbers between 0 and 100, such as 99.95.

Differences between the percentile functions:

Function Description
perc(<value>,<percentile>)
Use the perc function to calculate an approximate threshold, such that the values in a field fall below the percentile threshold you specify. The perc function returns a single number that represents the lower end of the approximate values for the percentile requested.
upperperc(<value>,<percentile>) When there are more than 1000 values, the upperperc function gives the approximate upper bound for the percentile requested. Otherwise the upperperc function returns the same percentile as the perc function.
exactperc(<value>,<percentile>) The exactperc function provides the exact value, but is very resource expensive for high cardinality fields. The exactperc function can consume a large amount of memory, which might impact how long it takes for a search to complete.

The percentile functions process field values as strings.

The perc and upperperc functions give approximate values for the integer percentile requested. The approximation algorithm that is used, which is based on dynamic compression of a radix tree, provides a strict bound of the actual value for any percentile.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

In the search results, the field names use this format: perc<percentage>(<value>) regardless of the

Valid syntax

To use this function, you can specify perc() using one of the following syntaxes:

Syntax Example
perc(<value>,<percentile>)

... | stats perc(host, 95)

perc<percentile>(<value>)

... | stats perc95(host)

Differences between Splunk and Excel percentile algorithms

If there are less than 1000 distinct values, the Splunk percentile functions use the nearest rank algorithm. See http://en.wikipedia.org/wiki/Percentile#Nearest_rank. Excel uses the NIST interpolated algorithm, which basically means you can get a value for a percentile that does not exist in the actual data, which is not possible for the nearest rank approach.


Splunk algorithm with more than 1000 distinct values

If there are more than 1000 distinct values for the field, the percentiles are approximated using a custom radix-tree digest-based algorithm. This algorithm is much faster and uses much less memory, a constant amount, than an exact computation, which uses memory in linear relation to the number of distinct values. By default this approach limits the approximation error to < 1% of rank error. That means if you ask for 95th percentile, the number you get back is between the 94th and 96th percentile.

You always get the exact percentiles even for more than 1000 distinct values by using the exactperc function instead of the perc function.

Basic examples

Consider this list of values {10,9,8,7,6,5,4,3,2,1} in the myfield field.

The following example returns the 90th percentage for the values:

...| stats perc(myfield, 90)

The results look like this:

perc90(myfield)
9.1

The following example returns several percentages:

...| stats perc(score,95), perc(score,50), perc(score,25)

The results look like this:

perc95(myfield) perc50(myfield) perc55(myfield)
9.55 5.5 3.25

Extended example

Consider the following set of data, which shows the number of visitors for each hour a store is open:

hour visitors
0800 0
0900 212
1000 367
1100 489
1200 624
1300 609
1400 492
1500 513
1600 376
1700 337

Let's use a dataset literal to create a temporary dataset from this data and then run the streamstats command to create a cumulative total for the visitors.

| FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}] | streamstats sum(visitors) as 'visitors total'

The results from this search look like this:

hour visitors visitors total
0800 0 0
0900 212 212
1000 367 579
1100 489 1068
1200 624 1692
1300 609 2301
1400 492 2793
1500 513 3306
1600 376 3673
1700 337 4010

Let's add the stats command with the perc function to determine the 50th and 95th percentiles.

| FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}] | streamstats sum(visitors) as 'visitors total' | stats perc50('visitors total') perc95('visitors total')

The results from this search look like this:

perc50(visitors total) perc95(visitors total)
1996.5 3858.35

The perc50 estimates the 50th percentile, when 50% of the visitors had arrived. You can see from the data that the 50th percentile was reached between visitor number 1996 and 1997, which was sometime between 1200 and 1300 hours. The perc95 estimates the 95th percentile, when 95% of the visitors had arrived. The 95th percentile was reached with visitor 3858, which occurred between 1600 and 1700 hours.

range(<value>)

Returns the difference between the maximum and minimum values in a field.

Usage

The values in the field must be numeric.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

This example uses events that list the numeric sales for each product and quarter, for example:

products quarter sales quota
ProductA QTR1 1200 1000
ProductB QTR1 1400 1550
ProductC QTR1 1650 1275
ProductA QTR2 1425 1300
ProductB QTR2 1175 1425
ProductC QTR2 1550 1450
ProductA QTR3 1300 1400
ProductB QTR3 1250 1125
ProductC QTR3 1375 1475
ProductA QTR4 1550 1300
ProductB QTR4 1700 1225
ProductC QTR4 1625 1350

It is easiest to understand the range if you also determine the min and max values. To determine the range of sales by product, run this search:

... | stats sum(sales), min(sales), max(sales), range(sales) BY products

The results look something like this:

quarter sum(sales) min(sales) max(sales) range(sales)
QTR1 4250 1200 1650 450
QTR2 4150 1175 1550 375
QTR3 3925 1250 1375 125
QTR4 4875 1550 1700 150

The range(sales) is max(sales) minus min(sales).

Extended example

See the Extended example for the max() function. That example includes the range() function.

stdev(<value>)

This function returns the sample standard deviation of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

A sample standard deviation is different than a population standard deviation. A sample standard deviation is calculated from a larger population by dividing the number of data points by 1 less than the total data points.

For information about creating a population standard deviation, see the stdevp function.

Basic example

This example returns the standard deviation of fields that end in the word delay which can apply to both, delay and xdelay. Because a wildcard is used, you must use single quotation marks to enclose the field name.

... | stats stdev('*delay')

Extended example

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.

source=usgs place=*California* | stats count(), mean(mag), stdev(mag), var(mag) BY magType

The results look like this:

magType count mean(mag) std(mag) var(mag)
H 123 0.549593 0.356985 0.127438
MbLg 1 0.000000 0.000000 0.000000
Md 1565 1.056486 0.580042 0.336449
Me 2 1.800000 0.346410 0.120000
Ml 1202 1.226622 0.629664 0.396476
Mw 6 3.650000 0.716240 0.513000
ml 10 0.934000 0.560401 0.314049

stdevp(<value>)

This function returns the population standard deviation of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

A sample population deviation is different than a sample standard deviation. A population standard deviation is calculated from the entire population.

For information about creating a sample standard deviation, see the stdev function.

Basic example

The following example calculates the population standard deviation of the field lag.

... | stats stdevp(lag)

sum(<value>)

This function returns the sum of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

You can create totals for any numeric field. For example:

...| stats sum(bytes)

The results look something like this:

sum(bytes)
21502


You can rename the column using the AS keyword:

...| stats sum(bytes) AS "total bytes"

The results look something like this:

total bytes
21502


You can organize the results using a BY clause:

...| stats sum(bytes) AS "total bytes" by date_hour

The results look something like this:

date_hour total bytes
07 6509
11 3726
15 6569
23 4698

sumsq(<value>)

This function returns the sum of the squares of the values in a field.

The sum of the squares is used to evaluate the variance of a dataset from the dataset mean. A large sum of the squares indicates a large variance, which tells you that individual values fluctuate widely from the mean.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Example

The following table contains the temperatures taken every day at 8 AM for a week.

You calculate the mean of the these temperatures and get 48.9 degrees. To calculate the deviation from the mean for each day, take the temperature and subtract the mean. If you square each temperature, you get results like this:

day temperature mean deviation square of temperatures
sunday 65 48.9 16.1 260.6
monday 42 48.9 -6.9 47.0
tuesday 40 48.9 -8.9 78.4
wednesday 31 48.9 -17.9 318.9
thursday 47 48.9 -1.9 3.4
friday 53 48.9 4.1 17.2
saturday 64 48.9 15.1 229.3

Take the total of the squares, 954.9, and divide by 6 which is the number of days minus 1. This gets you the sum of squares for this series of temperatures. The standard deviation is the square root of the sum of the squares. The larger the standard deviation the larger the fluctuation in temperatures during the week.

You can calculate the mean, sum of the squares, and standard deviation with a few statistical functions. Let's use a dataset literal so that you can try this on your own:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] | stats mean(temp), sumsq(temp), stdev(temp)

This search returns these results:

mean(temp) sumsq(temp) stdev(temp)
48.857142857142854 17664 12.615183595289349

upperperc(<value>,<percentile>)

This function returns an approximate upper bound percentile value of a numeric field that contains more than 1000 values.

The upperperc function returns an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the data points are below the estimate and 5% of the data points are above the estimate.

Usage

There are three percentile functions: exactperc, perc, and upperperc. For details about the differences, see the perc function.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

See the perc function.

var(<value>)

Returns the sample variance of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

Variance is often used to determine the standard deviation of a set of values.

The following table contains the temperatures taken every day at 8 AM for a week.

day temperature
sunday 65
monday 42
tuesday 40
wednesday 31
thursday 47
friday 53
saturday 64

Let's use a dataset literal so that you can try this on your own:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] |stats mean(temp)

You calculate the mean, which is the average, of the these temperatures and get 48.9 degrees.

mean(temp)
48.857142857142854

To calculate the deviation from the mean for each day, take the temperature and subtract the mean.

day temp mean deviation
sunday 65 48.9 16.1
monday 42 48.9 -6.9
tuesday 40 48.9 -8.9
wednesday 31 48.9 -17.9
thursday 47 48.9 -1.9
friday 53 48.9 4.1
saturday 64 48.9 15.1

To calculate the variance, square each deviation and average the result. Or just run this search:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] |stats mean(temp), var(temp), stdev(temp)


This search returns these results:

mean(temp) var(temp) stdev(temp)
48.857142857142854 159.14285714285748 12.615183595289349

The larger the standard deviation the larger the fluctuation in temperatures during the week.

Extended example

See the Extended example for the mean() function. That example includes the var() function.

varp(<value>)

Returns the population variance of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

See also

Function information
Overview of SPL2 stats and chart functions
Quick Reference for SPL2 Stats and Charting Functions
Naming function arguments in the SPL2 Search Manual
Last modified on 16 May, 2023
PREVIOUS
Quick Reference for SPL2 Stats and Charting Functions
  NEXT
Event order functions

This documentation applies to the following versions of Splunk® Cloud Services: current


Was this documentation topic helpful?


You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters