Aggregate functions

The SPL2 aggregate functions summarize the values from each event to create a single, meaningful value. Common aggregate functions include Average, Count, Minimum, Maximum, Standard Deviation, Sum, and Variance.

Most aggregate functions are used with numeric fields. However, there are some functions that you can use with either alphabetic string fields or numeric fields. The function descriptions indicate which functions you can use with alphabetic strings.

For an overview, see Overview of SPL2 stats functions.

avg(<value>)

This function returns the average, or mean, of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The following example returns the average of the values in the size field for each distinct value in the host field.

... | stats avg(size) BY host

The following example returns the average thruput of each host for each 5 minute time span.

... | bin _time span=5m | stats avg(thruput) BY host

The following example displays a timechart of the average of the cpu_seconds field by processor, rounded to 2 decimal points.

... | timechart eval(round(avg(cpu_seconds),2)) BY processor

When you use a eval expression with the timechart command, you must also use BY clause.

count(<value>) or c(<value>)

This function returns the number of occurrences in a field.

Usage

To use this function, you can specify count(<value>), or the abbreviation c(<value>).

This function processes field values as strings.

To indicate a specific field value to match, use the format <field>=<value>.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

Several of these examples use an eval expression with the count function. See Using eval expressions in stats functions.

The following example returns the count of events where the status field has the value "404".

...| stats count(eval(status="404")) AS count_status BY sourcetype

The following example separates the search results into 10 bins and counts the values in the _raw field for each bin.

... | bin size bins=10 | stats count(_raw) BY size

You can use search literals in functions that accept predicate expressions. Search literals are enclosed in backtick characters ( ` ). The following search uses a search literal to count the occurrences of the value 500 in your events. The results are organized by the host field:

... | stats count(`500`) by host

For more information, see Search literals in expressions.

The following example uses the timechart command to count the events where the action field contains the value purchase.

| from my_dataset where sourcetype="access_*" | timechart count(eval(action="purchase")) BY productName usenull=f useother=f

distinct_count(<value>) or dc(<value>)

This function returns the count of distinct values in a field.

Usage

To use this function, you can specify distinct_count(), or the abbreviation dc().

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic examples

The following example removes duplicate results with the same host value and returns the total count of the remaining results.

... | stats dc(host)

The following example generates the distinct count of the values in the devices field. The results are returned in a field called numdevices.

...| stats dc(device) AS numdevices

The following example counts the distinct client IP addresses for each host and category, and then bins the count for each result into five minute spans.

... | stats dc(clientip) BY host, categoryId | bin _time span=5m

Extended example

The following search counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The search organizes the count by the type of product , such as accessories, t-shirts, and type of games, that customers purchased.

| from my_dataset where sourcetype="access_*" action=purchase | stats dc(clientip) BY categoryId

This example first searches for purchase events, action=purchase.
These results are piped into the stats command and the dc() function counts the number of distinct users who made purchases.
The BY clause is used to organize the distinct count based on the different category of products, the categoryId.

The results look like this:

categoryId	dc(clientip)
ACCESSORIES	37
ARCADE	58
NULL	8
SHOOTER	31
SIMULATION	34
SPORTS	13
STRATEGY	74
TEE	38

estdc(<value>)

This function returns the estimated count of the distinct values in a field.

Usage

This function processes field values as strings.

The string values 1.0 and 1 are considered distinct values and counted separately.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The following example removes duplicate results with the same host value in a field, and returns the estimated total count of the remaining results.

... | stats estdc(host)

The results look like this:

estdc(host)
6

The following example generates the estimated distinct count of the devices field and renames the results field, numdevices.

...| stats estdc(devices) AS numdevices

The following example estimates the distinct count for the values in the source field for each sourcetype. The results are organized into 5 minute time spans.

... | stats estdc(source) BY sourcetype | bin 5m

estdc_error(<value>)

This function returns the theoretical error of the estimated count of the distinct values in a field.

The error represents this ratio:

absolute_value(estimate_distinct_count - real_distinct_count) / real_distinct_count

Usage

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Example

The following example determines the error ratio for the estimated distinct count of the values in the host field.

... | stats estdc_error(host)

exactperc(<value>,<percentile>)

This function returns an exact percentile based on the values in a numeric field.

The exactperc function provides the exact value, but is very resource expensive for high cardinality fields. The exactperc function can consume a large amount of memory, which might impact how long it takes for a search to complete.

Usage

There are three percentile functions: exactperc, perc, and upperperc. For details about the differences, see the perc function.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

See the perc function.

max(<value>)

This function returns the maximum value in a field.

Usage

This function processes field values as numbers if possible, otherwise processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If the values in the field are non-numeric, the maximum value is found using lexicographical ordering.

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
Uppercase letters are sorted before lowercase letters.
Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Basic example

This example returns the maximum value of the size field.

... | stats max(size)

Extended example

Calculate aggregate statistics for the magnitudes of earthquakes in an area

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Search for earthquakes in and around California. Calculate the number of earthquakes that were recorded. Use statistical functions to calculate the minimum, maximum, range (the difference between the min and max), and average magnitudes of the recent earthquakes. List the values by magnitude type.

source=all_month.csv place=*California* | stats count, max(mag), min(mag), range(mag), avg(mag) BY magType

The results look like this:

magType	count	max(mag)	min(mag)	range(mag)	avg(mag)
H	123	2.8	0.0	2.8	0.549593
MbLg	1	0	0	0	0.0000000
Md	1565	3.2	0.1	3.1	1.056486
Me	2	2.0	1.6	.04	1.800000
Ml	1202	4.3	-0.4	4.7	1.226622
Mw	6	4.9	3.0	1.9	3.650000
ml	10	1.56	0.19	1.37	0.934000

mean(<value>)

This function returns the arithmetic mean of the values in a field.

The mean values should be exactly the same as the values calculated using the avg() function.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

The following example returns the mean of the values in the kbps field.

...| stats mean(kbps)

Extended example

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.

source=usgs place=*California* | stats count mean(mag), stdev(mag), var(mag) BY magType

The results look like this:

magType	count	mean(mag)	std(mag)	var(mag)
H	123	0.549593	0.356985	0.127438
MbLg	1	0.000000	0.000000	0.000000
Md	1565	1.056486	0.580042	0.336449
Me	2	1.800000	0.346410	0.120000
Ml	1202	1.226622	0.629664	0.396476
Mw	6	3.650000	0.716240	0.513000
ml	10	0.934000	0.560401	0.314049

median(<value>)

This function returns the middle-most value of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If you have an even number of search results, the median calculation is the average of the two middle-most values.

Basic example

Consider the following list of values, which counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The values are organized by the type of product (accessories, t-shirts, and type of games) that customers purchased.

categoryId	count
ACCESSORIES	37
ARCADE	58
NULL	8
SIMULATION	34
SPORTS	13
STRATEGY	74
TEE	38

When the list is sorted the the middle-most value is 37.

categoryId	count
NULL	8
SPORTS	13
SIMULATION	34
ACCESSORIES	37
TEE	38
ARCADE	58
STRATEGY	74

The middle-most value is returned when there are an odd number of results. When there are an even number of results, the average of the two middle-most numbers is returned.

min(<value>)

This function returns the minimum value in a field.

Usage

This function processes field values as numbers if possible, otherwise processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If the values in the field are non-numeric, the minimum value is found using lexicographical ordering.

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
Uppercase letters are sorted before lowercase letters.
Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Basic examples

The following example returns the minimum size and maximum size of the HotBucketRoller component in the _internal index.

index=_internal component=HotBucketRoller | stats min(size), max(size)

The following example returns a list of processors and calculates the minimum cpu_seconds and the maximum cpu_seconds.

index=_internal | chart min(cpu_seconds), max(cpu_seconds) BY processor

Extended example

See the Extended example for the max() function. That example includes the min() function.

mode(<value>)

This function returns the most frequent value in a field.

Usage

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The mode returns the most frequent value. Consider the following data:

firstname	surname	age
Claudia	Garcia	32
David	Mayer	45
Alex	Garcia	29
Wei	Zhang	45
Javier	Garcia	37

When you search for the mode in the age field, the value 45 is returned.

...| stats mode(age)

You can also use mode with fields that contain string values. When you search for the mode in the surname field, the value Garcia is returned.

...| stats mode(surname)

Here's another set of sample data:

_time	host	sourcetype
04-06-2020 17:06:23.000 PM	www1	access_combined
04-06-2020 10:34:19.000 AM	www1	access_combined
04-03-2020 13:52:18.000 PM	www2	access_combined
04-02-2020 07:39:59.000 AM	www3	access_combined
04-01-2020 19:35:58.000 PM	www1	access_combined

If you run a search that looks for the mode in the host field, the value www1 is returned because it is the most common value in the host field. For example:

... |stats mode(host)

The results look something like this:

mode(host)
www1

perc(<value>,<percentile>)

This function returns an approximate percentile based on the values in a numeric field.

There are three percentile functions. These functions return the nth percentile of the values in a numeric field. You can think of this as an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the values in the field are below the estimate and 5% of the values in the field are above the estimate.

Valid percentiles are floating point numbers between 0 and 100, such as 99.95.

Differences between the percentile functions:

Function	Description
perc(<value>,<percentile>)	Use the `perc` function to calculate an approximate threshold, such that the values in a field fall below the percentile threshold you specify. The `perc` function returns a single number that represents the lower end of the approximate values for the percentile requested.
upperperc(<value>,<percentile>)	When there are more than 1000 values, the `upperperc` function gives the approximate upper bound for the percentile requested. Otherwise the `upperperc` function returns the same percentile as the `perc` function.
exactperc(<value>,<percentile>)	The `exactperc` function provides the exact value, but is very resource expensive for high cardinality fields. The `exactperc` function can consume a large amount of memory, which might impact how long it takes for a search to complete.

The percentile functions process field values as strings.

The perc and upperperc functions give approximate values for the integer percentile requested. The approximation algorithm that is used, which is based on dynamic compression of a radix tree, provides a strict bound of the actual value for any percentile.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

In the search results, the field names use this format: perc<percentage>(<value>) regardless of the

Valid syntax

To use this function, you can specify perc() using one of the following syntaxes:

Syntax	Example
`perc(<value>,<percentile>)`	`... \| stats perc(host, 95)`
`perc<percentile>(<value>)`	`... \| stats perc95(host)`

Differences between Splunk and Excel percentile algorithms

If there are less than 1000 distinct values, the Splunk percentile functions use the nearest rank algorithm. See http://en.wikipedia.org/wiki/Percentile#Nearest_rank. Excel uses the NIST interpolated algorithm, which basically means you can get a value for a percentile that does not exist in the actual data, which is not possible for the nearest rank approach.

Splunk algorithm with more than 1000 distinct values

If there are more than 1000 distinct values for the field, the percentiles are approximated using a custom radix-tree digest-based algorithm. This algorithm is much faster and uses much less memory, a constant amount, than an exact computation, which uses memory in linear relation to the number of distinct values. By default this approach limits the approximation error to < 1% of rank error. That means if you ask for 95th percentile, the number you get back is between the 94th and 96th percentile.

You always get the exact percentiles even for more than 1000 distinct values by using the exactperc function instead of the perc function.

Basic examples

Consider this list of values {10,9,8,7,6,5,4,3,2,1} in the myfield field.

The following example returns the 90th percentage for the values:

...| stats perc(myfield, 90)

The results look like this:

perc90(myfield)
9.1

The following example returns several percentages:

...| stats perc(score,95), perc(score,50), perc(score,25)

The results look like this:

perc95(myfield)	perc50(myfield)	perc55(myfield)
9.55	5.5	3.25

Extended example

Consider the following set of data, which shows the number of visitors for each hour a store is open:

hour	visitors
0800	0
0900	212
1000	367
1100	489
1200	624
1300	609
1400	492
1500	513
1600	376
1700	337

Let's use a dataset literal to create a temporary dataset from this data and then run the streamstats command to create a cumulative total for the visitors.

| FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}] | streamstats sum(visitors) as 'visitors total'

The results from this search look like this:

hour	visitors	visitors total
0800	0	0
0900	212	212
1000	367	579
1100	489	1068
1200	624	1692
1300	609	2301
1400	492	2793
1500	513	3306
1600	376	3673
1700	337	4010

Let's add the stats command with the perc function to determine the 50th and 95th percentiles.

| FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}] | streamstats sum(visitors) as 'visitors total' | stats perc50('visitors total') perc95('visitors total')

The results from this search look like this:

perc50(visitors total)	perc95(visitors total)
1996.5	3858.35

The perc50 estimates the 50th percentile, when 50% of the visitors had arrived. You can see from the data that the 50th percentile was reached between visitor number 1996 and 1997, which was sometime between 1200 and 1300 hours. The perc95 estimates the 95th percentile, when 95% of the visitors had arrived. The 95th percentile was reached with visitor 3858, which occurred between 1600 and 1700 hours.

range(<value>)

Returns the difference between the maximum and minimum values in a field.

Usage

The values in the field must be numeric.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

This example uses events that list the numeric sales for each product and quarter, for example:

products	quarter	sales	quota
ProductA	QTR1	1200	1000
ProductB	QTR1	1400	1550
ProductC	QTR1	1650	1275
ProductA	QTR2	1425	1300
ProductB	QTR2	1175	1425
ProductC	QTR2	1550	1450
ProductA	QTR3	1300	1400
ProductB	QTR3	1250	1125
ProductC	QTR3	1375	1475
ProductA	QTR4	1550	1300
ProductB	QTR4	1700	1225
ProductC	QTR4	1625	1350

It is easiest to understand the range if you also determine the min and max values. To determine the range of sales by product, run this search:

... | stats sum(sales), min(sales), max(sales), range(sales) BY products

The results look something like this:

quarter	sum(sales)	min(sales)	max(sales)	range(sales)
QTR1	4250	1200	1650	450
QTR2	4150	1175	1550	375
QTR3	3925	1250	1375	125
QTR4	4875	1550	1700	150

The range(sales) is max(sales) minus min(sales).

Extended example

See the Extended example for the max() function. That example includes the range() function.

stdev(<value>)

This function returns the sample standard deviation of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

A sample standard deviation is different than a population standard deviation. A sample standard deviation is calculated from a larger population by dividing the number of data points by 1 less than the total data points.

For information about creating a population standard deviation, see the stdevp function.

Basic example

This example returns the standard deviation of fields that end in the word delay which can apply to both, delay and xdelay. Because a wildcard is used, you must use single quotation marks to enclose the field name.

... | stats stdev('*delay')

Extended example

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.

source=usgs place=*California* | stats count(), mean(mag), stdev(mag), var(mag) BY magType

The results look like this:

magType	count	mean(mag)	std(mag)	var(mag)
H	123	0.549593	0.356985	0.127438
MbLg	1	0.000000	0.000000	0.000000
Md	1565	1.056486	0.580042	0.336449
Me	2	1.800000	0.346410	0.120000
Ml	1202	1.226622	0.629664	0.396476
Mw	6	3.650000	0.716240	0.513000
ml	10	0.934000	0.560401	0.314049

stdevp(<value>)

This function returns the population standard deviation of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

A sample population deviation is different than a sample standard deviation. A population standard deviation is calculated from the entire population.

For information about creating a sample standard deviation, see the stdev function.

Basic example

The following example calculates the population standard deviation of the field lag.

... | stats stdevp(lag)

sum(<value>)

This function returns the sum of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

You can create totals for any numeric field. For example:

...| stats sum(bytes)

The results look something like this:

sum(bytes)
21502

You can rename the column using the AS keyword:

...| stats sum(bytes) AS "total bytes"

The results look something like this:

total bytes
21502

You can organize the results using a BY clause:

...| stats sum(bytes) AS "total bytes" by date_hour

The results look something like this:

date_hour	total bytes
07	6509
11	3726
15	6569
23	4698

sumsq(<value>)

This function returns the sum of the squares of the values in a field.

The sum of the squares is used to evaluate the variance of a dataset from the dataset mean. A large sum of the squares indicates a large variance, which tells you that individual values fluctuate widely from the mean.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Example

The following table contains the temperatures taken every day at 8 AM for a week.

You calculate the mean of the these temperatures and get 48.9 degrees. To calculate the deviation from the mean for each day, take the temperature and subtract the mean. If you square each temperature, you get results like this:

day	temperature	mean	deviation	square of temperatures
sunday	65	48.9	16.1	260.6
monday	42	48.9	-6.9	47.0
tuesday	40	48.9	-8.9	78.4
wednesday	31	48.9	-17.9	318.9
thursday	47	48.9	-1.9	3.4
friday	53	48.9	4.1	17.2
saturday	64	48.9	15.1	229.3

Take the total of the squares, 954.9, and divide by 6 which is the number of days minus 1. This gets you the sum of squares for this series of temperatures. The standard deviation is the square root of the sum of the squares. The larger the standard deviation the larger the fluctuation in temperatures during the week.

You can calculate the mean, sum of the squares, and standard deviation with a few statistical functions. Let's use a dataset literal so that you can try this on your own:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] | stats mean(temp), sumsq(temp), stdev(temp)

This search returns these results:

mean(temp)	sumsq(temp)	stdev(temp)
48.857142857142854	17664	12.615183595289349

upperperc(<value>,<percentile>)

This function returns an approximate upper bound percentile value of a numeric field that contains more than 1000 values.

The upperperc function returns an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the data points are below the estimate and 5% of the data points are above the estimate.

Usage

There are three percentile functions: exactperc, perc, and upperperc. For details about the differences, see the perc function.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

See the perc function.

var(<value>)

Returns the sample variance of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

Variance is often used to determine the standard deviation of a set of values.

The following table contains the temperatures taken every day at 8 AM for a week.

day	temperature
sunday	65
monday	42
tuesday	40
wednesday	31
thursday	47
friday	53
saturday	64

Let's use a dataset literal so that you can try this on your own:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] |stats mean(temp)

You calculate the mean, which is the average, of the these temperatures and get 48.9 degrees.

mean(temp)
48.857142857142854

To calculate the deviation from the mean for each day, take the temperature and subtract the mean.

day	temp	mean	deviation
sunday	65	48.9	16.1
monday	42	48.9	-6.9
tuesday	40	48.9	-8.9
wednesday	31	48.9	-17.9
thursday	47	48.9	-1.9
friday	53	48.9	4.1
saturday	64	48.9	15.1

To calculate the variance, square each deviation and average the result. Or just run this search:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] |stats mean(temp), var(temp), stdev(temp)

This search returns these results:

mean(temp)	var(temp)	stdev(temp)
48.857142857142854	159.14285714285748	12.615183595289349

The larger the standard deviation the larger the fluctuation in temperatures during the week.

Extended example

See the Extended example for the mean() function. That example includes the var() function.

varp(<value>)

Returns the population variance of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Related answers from Splunk Community

Aggregate functions

avg(<value>)

Usage

Examples

count(<value>) or c(<value>)

Usage

Examples

distinct_count(<value>) or dc(<value>)

Usage

Basic examples

Extended example

estdc(<value>)

Usage

Examples

estdc_error(<value>)

Usage

Example