Aggregate functions
The SPL2 aggregate functions summarize the values from each event to create a single, meaningful value. Common aggregate functions include Average, Count, Minimum, Maximum, Standard Deviation, Sum, and Variance.
Most aggregate functions are used with numeric fields. However, there are some functions that you can use with either alphabetic string fields or numeric fields. The function descriptions indicate which functions you can use with alphabetic strings.
For an overview, see Overview of SPL2 stats functions.
avg(<value>)
This function returns the average, or mean, of the values in a field.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
The following example returns the average of the values in the size
field for each distinct value in the host
field.
...  stats avg(size) BY host
The following example returns the average thruput
of each host
for each 5 minute time span.
...  bin _time span=5m  stats avg(thruput) BY host
The following example displays a timechart of the average of the cpu_seconds
field by processor, rounded to 2 decimal points.
...  timechart eval(round(avg(cpu_seconds),2)) BY processor
When you use a eval expression with the timechart
command, you must also use BY clause.
count(<value>) or c(<value>)
This function returns the number of occurrences in a field.
Usage
To use this function, you can specify count(<value>)
, or the abbreviation c(<value>)
.
This function processes field values as strings.
To indicate a specific field value to match, use the format <field>=<value>
.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
Several of these examples use an eval expression with the count
function. See Using eval expressions in stats functions.
The following example returns the count of events where the status
field has the value "404".
... stats count(eval(status="404")) AS count_status BY sourcetype
The following example separates the search results into 10 bins and counts the values in the _raw
field for each bin.
...  bin size bins=10  stats count(_raw) BY size
You can use search literals in functions that accept predicate expressions. Search literals are enclosed in backtick characters ( ` ). The following search uses a search literal to count the occurrences of the value 500
in your events. The results are organized by the host
field:
...  stats count(`500`) by host
For more information, see Search literals in expressions.
The following example uses the timechart
command to count the events where the action
field contains the value purchase
.
 from my_dataset where sourcetype="access_*"
 timechart count(eval(action="purchase")) BY productName usenull=f useother=f
distinct_count(<value>) or dc(<value>)
This function returns the count of distinct values in a field.
Usage
To use this function, you can specify distinct_count()
, or the abbreviation dc()
.
This function processes field values as strings.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Basic examples
The following example removes duplicate results with the same host
value and returns the total count of the remaining results.
...  stats dc(host)
The following example generates the distinct count of the values in the devices
field. The results are returned in a field called numdevices
.
... stats dc(device) AS numdevices
The following example counts the distinct client IP addresses for each host and category, and then bins the count for each result into five minute spans.
...  stats dc(clientip) BY host, categoryId  bin _time span=5m
Extended example
The following search counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The search organizes the count by the type of product , such as accessories, tshirts, and type of games, that customers purchased.
 from my_dataset where sourcetype="access_*" action=purchase
 stats dc(clientip) BY categoryId
 This example first searches for purchase events,
action=purchase
.  These results are piped into the
stats
command and thedc()
function counts the number of distinct users who made purchases.  The
BY
clause is used to organize the distinct count based on the different category of products, thecategoryId
.
The results look like this:
categoryId  dc(clientip) 

ACCESSORIES  37 
ARCADE  58 
NULL  8 
SHOOTER  31 
SIMULATION  34 
SPORTS  13 
STRATEGY  74 
TEE  38 
estdc(<value>)
This function returns the estimated count of the distinct values in a field.
Usage
This function processes field values as strings.
The string values 1.0
and 1
are considered distinct values and counted separately.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
The following example removes duplicate results with the same host
value in a field, and returns the estimated total count of the remaining results.
...  stats estdc(host)
The results look like this:
estdc(host) 

6 
The following example generates the estimated distinct count of the devices
field and renames the results field, numdevices
.
... stats estdc(devices) AS numdevices
The following example estimates the distinct count for the values in the source
field for each sourcetype. The results are organized into 5 minute time spans.
...  stats estdc(source) BY sourcetype  bin 5m
estdc_error(<value>)
This function returns the theoretical error of the estimated count of the distinct values in a field.
The error represents this ratio:
 absolute_value(estimate_distinct_count  real_distinct_count) / real_distinct_count
Usage
This function processes field values as strings.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Example
The following example determines the error ratio for the estimated distinct count of the values in the host
field.
...  stats estdc_error(host)
exactperc(<value>,<percentile>)
This function returns an exact percentile based on the values in a numeric field.
The exactperc
function provides the exact value, but is very resource expensive for high cardinality fields. The exactperc
function can consume a large amount of memory, which might impact how long it takes for a search to complete.
Usage
There are three percentile functions: exactperc
, perc
, and upperperc
. For details about the differences, see the perc function.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
See the perc function.
max(<value>)
This function returns the maximum value in a field.
Usage
This function processes field values as numbers if possible, otherwise processes field values as strings.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
If the values in the field are nonnumeric, the maximum value is found using lexicographical ordering.
Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF8 encoding, which is a superset of ASCII.
 Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
 Uppercase letters are sorted before lowercase letters.
 Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.
Basic example
This example returns the maximum value of the size
field.
...  stats max(size)
Extended example
Calculate aggregate statistics for the magnitudes of earthquakes in an area
This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded. 
Search for earthquakes in and around California. Calculate the number of earthquakes that were recorded. Use statistical functions to calculate the minimum, maximum, range (the difference between the min and max), and average magnitudes of the recent earthquakes. List the values by magnitude type.
source=all_month.csv place=*California*  stats count, max(mag), min(mag), range(mag), avg(mag) BY magType
The results look like this:
magType  count  max(mag)  min(mag)  range(mag)  avg(mag) 

H  123  2.8  0.0  2.8  0.549593 
MbLg  1  0  0  0  0.0000000 
Md  1565  3.2  0.1  3.1  1.056486 
Me  2  2.0  1.6  .04  1.800000 
Ml  1202  4.3  0.4  4.7  1.226622 
Mw  6  4.9  3.0  1.9  3.650000 
ml  10  1.56  0.19  1.37  0.934000 
mean(<value>)
This function returns the arithmetic mean of the values in a field.
The mean
values should be exactly the same as the values calculated using the avg()
function.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Basic example
The following example returns the mean of the values in the kbps
field.
... stats mean(kbps)
Extended example
This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded. 
Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.
source=usgs place=*California*  stats count mean(mag), stdev(mag), var(mag) BY magType
The results look like this:
magType  count  mean(mag)  std(mag)  var(mag) 

H  123  0.549593  0.356985  0.127438 
MbLg  1  0.000000  0.000000  0.000000 
Md  1565  1.056486  0.580042  0.336449 
Me  2  1.800000  0.346410  0.120000 
Ml  1202  1.226622  0.629664  0.396476 
Mw  6  3.650000  0.716240  0.513000 
ml  10  0.934000  0.560401  0.314049 
median(<value>)
This function returns the middlemost value of the values in a field.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
If you have an even number of search results, the median calculation is the average of the two middlemost values.
Basic example
Consider the following list of values, which counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The values are organized by the type of product (accessories, tshirts, and type of games) that customers purchased.
categoryId  count 

ACCESSORIES  37 
ARCADE  58 
NULL  8 
SIMULATION  34 
SPORTS  13 
STRATEGY  74 
TEE  38 
When the list is sorted the the middlemost value is 37.
categoryId  count 

NULL  8 
SPORTS  13 
SIMULATION  34 
ACCESSORIES  37 
TEE  38 
ARCADE  58 
STRATEGY  74 
The middlemost value is returned when there are an odd number of results. When there are an even number of results, the average of the two middlemost numbers is returned.
min(<value>)
This function returns the minimum value in a field.
Usage
This function processes field values as numbers if possible, otherwise processes field values as strings.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
If the values in the field are nonnumeric, the minimum value is found using lexicographical ordering.
Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF8 encoding, which is a superset of ASCII.
 Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
 Uppercase letters are sorted before lowercase letters.
 Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.
Basic examples
The following example returns the minimum size and maximum size of the HotBucketRoller component in the _internal index.
index=_internal component=HotBucketRoller  stats min(size), max(size)
The following example returns a list of processors and calculates the minimum cpu_seconds and the maximum cpu_seconds.
index=_internal  chart min(cpu_seconds), max(cpu_seconds) BY processor
Extended example
See the Extended example for the max() function. That example includes the min()
function.
mode(<value>)
This function returns the most frequent value in a field.
Usage
This function processes field values as strings.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
The mode
returns the most frequent value. Consider the following data:
firstname  surname  age 

Claudia  Garcia  32 
David  Mayer  45 
Alex  Garcia  29 
Wei  Zhang  45 
Javier  Garcia  37 
When you search for the mode in the age
field, the value 45
is returned.
... stats mode(age)
You can also use mode with fields that contain string values. When you search for the mode in the surname
field, the value Garcia
is returned.
... stats mode(surname)
Here's another set of sample data:
_time  host  sourcetype 

04062020 17:06:23.000 PM  www1  access_combined 
04062020 10:34:19.000 AM  www1  access_combined 
04032020 13:52:18.000 PM  www2  access_combined 
04022020 07:39:59.000 AM  www3  access_combined 
04012020 19:35:58.000 PM  www1  access_combined 
If you run a search that looks for the mode in the host
field, the value www1
is returned because it is the most common value in the host
field. For example:
... stats mode(host)
The results look something like this:
mode(host) 

www1 
perc(<value>,<percentile>)
This function returns an approximate percentile based on the values in a numeric field.
There are three percentile functions. These functions return the nth percentile of the values in a numeric field. You can think of this as an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the values in the field are below the estimate and 5% of the values in the field are above the estimate.
Valid percentiles are floating point numbers between 0 and 100, such as 99.95.
Differences between the percentile functions:
Function  Description 

perc(<value>,<percentile>)

Use the perc function to calculate an approximate threshold, such that the values in a field fall below the percentile threshold you specify. The perc function returns a single number that represents the lower end of the approximate values for the percentile requested.

upperperc(<value>,<percentile>)  When there are more than 1000 values, the upperperc function gives the approximate upper bound for the percentile requested. Otherwise the upperperc function returns the same percentile as the perc function.

exactperc(<value>,<percentile>)  The exactperc function provides the exact value, but is very resource expensive for high cardinality fields. The exactperc function can consume a large amount of memory, which might impact how long it takes for a search to complete.

The percentile functions process field values as strings.
The perc
and upperperc
functions give approximate values for the integer percentile requested. The approximation algorithm that is used, which is based on dynamic compression of a radix tree, provides a strict bound of the actual value for any percentile.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
In the search results, the field names use this format: perc<percentage>(<value>)
regardless of the
Valid syntax
To use this function, you can specify perc()
using one of the following syntaxes:
Syntax  Example 

perc(<value>,<percentile>)


perc<percentile>(<value>)


Differences between Splunk and Excel percentile algorithms
If there are less than 1000 distinct values, the Splunk percentile functions use the nearest rank algorithm. See http://en.wikipedia.org/wiki/Percentile#Nearest_rank. Excel uses the NIST interpolated algorithm, which basically means you can get a value for a percentile that does not exist in the actual data, which is not possible for the nearest rank approach.
Splunk algorithm with more than 1000 distinct values
If there are more than 1000 distinct values for the field, the percentiles are approximated using a custom radixtree digestbased algorithm. This algorithm is much faster and uses much less memory, a constant amount, than an exact computation, which uses memory in linear relation to the number of distinct values. By default this approach limits the approximation error to < 1% of rank error. That means if you ask for 95th percentile, the number you get back is between the 94th and 96th percentile.
You always get the exact percentiles even for more than 1000 distinct values by using the exactperc
function instead of the perc
function.
Basic examples
Consider this list of values {10,9,8,7,6,5,4,3,2,1}
in the myfield
field.
The following example returns the 90th percentage for the values:
... stats perc(myfield, 90)
The results look like this:
perc90(myfield) 

9.1 
The following example returns several percentages:
... stats perc(score,95), perc(score,50), perc(score,25)
The results look like this:
perc95(myfield)  perc50(myfield)  perc55(myfield) 

9.55  5.5  3.25 
Extended example
Consider the following set of data, which shows the number of visitors for each hour a store is open:
hour  visitors 

0800  0 
0900  212 
1000  367 
1100  489 
1200  624 
1300  609 
1400  492 
1500  513 
1600  376 
1700  337 
Let's use a dataset literal to create a temporary dataset from this data and then run the streamstats
command to create a cumulative total for the visitors.
 FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}]  streamstats sum(visitors) as 'visitors total'
The results from this search look like this:
hour  visitors  visitors total 

0800  0  0 
0900  212  212 
1000  367  579 
1100  489  1068 
1200  624  1692 
1300  609  2301 
1400  492  2793 
1500  513  3306 
1600  376  3673 
1700  337  4010 
Let's add the stats
command with the perc
function to determine the 50th and 95th percentiles.
 FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}]  streamstats sum(visitors) as 'visitors total'  stats perc50('visitors total') perc95('visitors total')
The results from this search look like this:
perc50(visitors total)  perc95(visitors total) 

1996.5  3858.35 
The perc50
estimates the 50th percentile, when 50% of the visitors had arrived. You can see from the data that the 50th percentile was reached between visitor number 1996 and 1997, which was sometime between 1200 and 1300 hours. The perc95
estimates the 95th percentile, when 95% of the visitors had arrived. The 95th percentile was reached with visitor 3858, which occurred between 1600 and 1700 hours.
range(<value>)
Returns the difference between the maximum and minimum values in a field.
Usage
The values in the field must be numeric.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Basic example
This example uses events that list the numeric sales for each product and quarter, for example:
products  quarter  sales  quota 

ProductA  QTR1  1200  1000 
ProductB  QTR1  1400  1550 
ProductC  QTR1  1650  1275 
ProductA  QTR2  1425  1300 
ProductB  QTR2  1175  1425 
ProductC  QTR2  1550  1450 
ProductA  QTR3  1300  1400 
ProductB  QTR3  1250  1125 
ProductC  QTR3  1375  1475 
ProductA  QTR4  1550  1300 
ProductB  QTR4  1700  1225 
ProductC  QTR4  1625  1350 
It is easiest to understand the range
if you also determine the min
and max
values. To determine the range of sales by product, run this search:
...  stats sum(sales), min(sales), max(sales), range(sales) BY products
The results look something like this:
quarter  sum(sales)  min(sales)  max(sales)  range(sales) 

QTR1  4250  1200  1650  450 
QTR2  4150  1175  1550  375 
QTR3  3925  1250  1375  125 
QTR4  4875  1550  1700  150 
The range(sales)
is max(sales)
minus min(sales)
.
Extended example
See the Extended example for the max() function. That example includes the range()
function.
stdev(<value>)
This function returns the sample standard deviation of the values in a field.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
A sample standard deviation is different than a population standard deviation. A sample standard deviation is calculated from a larger population by dividing the number of data points by 1 less than the total data points.
For information about creating a population standard deviation, see the stdevp function.
Basic example
This example returns the standard deviation of fields that end in the word delay
which can apply to both, delay
and xdelay
. Because a wildcard is used, you must use single quotation marks to enclose the field name.
...  stats stdev('*delay')
Extended example
This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded. 
Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.
source=usgs place=*California*
 stats count(), mean(mag), stdev(mag), var(mag) BY magType
The results look like this:
magType  count  mean(mag)  std(mag)  var(mag) 

H  123  0.549593  0.356985  0.127438 
MbLg  1  0.000000  0.000000  0.000000 
Md  1565  1.056486  0.580042  0.336449 
Me  2  1.800000  0.346410  0.120000 
Ml  1202  1.226622  0.629664  0.396476 
Mw  6  3.650000  0.716240  0.513000 
ml  10  0.934000  0.560401  0.314049 
stdevp(<value>)
This function returns the population standard deviation of the values in a field.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
A sample population deviation is different than a sample standard deviation. A population standard deviation is calculated from the entire population.
For information about creating a sample standard deviation, see the stdev function.
Basic example
The following example calculates the population standard deviation of the field lag
.
...  stats stdevp(lag)
sum(<value>)
This function returns the sum of the values in a field.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
You can create totals for any numeric field. For example:
... stats sum(bytes)
The results look something like this:
sum(bytes) 

21502 
You can rename the column using the AS keyword:
... stats sum(bytes) AS "total bytes"
The results look something like this:
total bytes 

21502 
You can organize the results using a BY clause:
... stats sum(bytes) AS "total bytes" by date_hour
The results look something like this:
date_hour  total bytes 

07  6509 
11  3726 
15  6569 
23  4698 
sumsq(<value>)
This function returns the sum of the squares of the values in a field.
The sum of the squares is used to evaluate the variance of a dataset from the dataset mean. A large sum of the squares indicates a large variance, which tells you that individual values fluctuate widely from the mean.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Example
The following table contains the temperatures taken every day at 8 AM for a week.
You calculate the mean of the these temperatures and get 48.9 degrees. To calculate the deviation from the mean for each day, take the temperature and subtract the mean. If you square each temperature, you get results like this:
day  temperature  mean  deviation  square of temperatures 

sunday  65  48.9  16.1  260.6 
monday  42  48.9  6.9  47.0 
tuesday  40  48.9  8.9  78.4 
wednesday  31  48.9  17.9  318.9 
thursday  47  48.9  1.9  3.4 
friday  53  48.9  4.1  17.2 
saturday  64  48.9  15.1  229.3 
Take the total of the squares, 954.9, and divide by 6 which is the number of days minus 1. This gets you the sum of squares for this series of temperatures. The standard deviation is the square root of the sum of the squares. The larger the standard deviation the larger the fluctuation in temperatures during the week.
You can calculate the mean, sum of the squares, and standard deviation with a few statistical functions. Let's use a dataset literal so that you can try this on your own:
FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}]
 stats mean(temp), sumsq(temp), stdev(temp)
This search returns these results:
mean(temp)  sumsq(temp)  stdev(temp) 

48.857142857142854  17664  12.615183595289349 
upperperc(<value>,<percentile>)
This function returns an approximate upper bound percentile value of a numeric field that contains more than 1000 values.
The upperperc
function returns an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the data points are below the estimate and 5% of the data points are above the estimate.
Usage
There are three percentile functions: exactperc
, perc
, and upperperc
. For details about the differences, see the perc function.
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
See the perc function.
var(<value>)
Returns the sample variance of the values in a field.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Examples
Variance is often used to determine the standard deviation of a set of values.
The following table contains the temperatures taken every day at 8 AM for a week.
day  temperature 

sunday  65 
monday  42 
tuesday  40 
wednesday  31 
thursday  47 
friday  53 
saturday  64 
Let's use a dataset literal so that you can try this on your own:
FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] stats mean(temp)
You calculate the mean, which is the average, of the these temperatures and get 48.9 degrees.
mean(temp) 

48.857142857142854 
To calculate the deviation from the mean for each day, take the temperature and subtract the mean.
day  temp  mean  deviation 

sunday  65  48.9  16.1 
monday  42  48.9  6.9 
tuesday  40  48.9  8.9 
wednesday  31  48.9  17.9 
thursday  47  48.9  1.9 
friday  53  48.9  4.1 
saturday  64  48.9  15.1 
To calculate the variance, square each deviation and average the result. Or just run this search:
FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] stats mean(temp), var(temp), stdev(temp)
This search returns these results:
mean(temp)  var(temp)  stdev(temp) 

48.857142857142854  159.14285714285748  12.615183595289349 
The larger the standard deviation the larger the fluctuation in temperatures during the week.
Extended example
See the Extended example for the mean() function. That example includes the var()
function.
varp(<value>)
Returns the population variance of the values in a field.
Usage
You can use this function with the stats
, eventstats
, streamstats
, and timechart
commands.
Basic example
See also
 Function information
 Overview of SPL2 stats and chart functions
 Quick Reference for SPL2 Stats and Charting Functions
 Naming function arguments in the SPL2 Search Manual
Quick Reference for SPL2 Stats and Charting Functions  Event order functions 
This documentation applies to the following versions of Splunk^{®} Cloud Services: current
Feedback submitted, thanks!