Finding and removing outliers
This section describes outliers. For a complete list of topics on detecting anomalies, detecting patterns, and time series forecasting see About advanced statistics, in this manual.
What is an outlier?
An outlier is a data point that is far removed from the typical distribution of data points.
In some situations you might want to simply identify the outliers. In other situations you might want to remove the outliers so that they do not skew your statistical results or create issues displaying data in your charts.
Statistical outliers
A statistical outlier is a data point that is far removed from some measure of centrality. Typical measures of centrality are mean, median, and mode. Mean is the average value. Median is the value at the middle of a sorted list of all values. Mode is the most frequent value.
Centrality measure | Splunk centrality commands |
---|---|
Mean | stats avg(field)
|
Median | stats median(field)
|
Mode | stats mode(field)
|
There are different types of statistical outliers. Outliers can be far from the mean, far from the median, or a small number relative to the mode. An outlier can also be a data point that is far removed from the typical range of data points.
Identifying outliers
There are several methods you can use to identify outliers. Often these methods involve calculating some measure of centrality and then identifying the outliers.
In some cases outliers are identified when you notice an anomaly. For example, when you chart the data and you notice that the axis is skewed. An outlier can be the culprit.
Use the number of statistical outliers as a bellwether. If you see more statistical outliers than usual, that phenomenon itself is an anomaly.
To calculate the centrality measure, you can use the following commands.
Centrality measure | Splunk centrality commands |
---|---|
Standard deviation | stats stdev(field)
|
Quartiles and percentiles | stats perc75(field) perc25(field)
|
Top 3 most frequent values | top 3 field
|
In many cases, you need additional commands to calculate the information that you are looking for. The following sections contain some of these examples.
Calculate the lower and upper boundaries of an acceptable range to identify outliers
The following example takes the first 500 events from the quote.csv
file. The streamstats
command and a moving window of 100 events are used to calculate the average and standard deviation.
The average and standard deviation are used with the eval
command to calculate the lower and upper boundaries. A sensitivity is added into the calculation by multiplying the stdev
by 2. The eval
command uses those boundaries to identify the outliers. The outliers are then sorted in descending order.
| inputlookup quote.csv
| head 500
| eval _time=(round(strptime(time, "%Y-%m-%d %H:%M:%SZ")))
| streamstats window=100 avg("price") as
avg stdev("price") as stdev
| eval lowerBound=(avg-stdev*2)
| eval upperBound=(avg+stdev*2)
| eval isOutlier=if('price' < lowerBound
OR 'price' > upperBound, 1, 0)
| fields "_time", "symbol", "sourcetype", "time", "price",
"lowerBound", "upperBound", "isOutlier"
| sort - isOutlier
Use the interquartile range (IQR) to identify outliers
The following example takes the first 500 events from the quote.csv
file. The eventstats
command is used to calculate the median, the 25th percentile (p25), and the 75th percentile (p75). The IQR is calculated with the eval
command by subtracting the percentiles. The median and the IQR are used with the eval
command to calculate the lower and upper boundaries. A sensitivity is added into the calculation by multiplying the IQR by 20.The eval
command uses those boundaries to identify the outliers. The outliers are then sorted in descending order.
| inputlookup quote.csv
| head 500
| eval _time=(round(strptime(time, "%Y-%m-%d %H:%M:%SZ")))
| eventstats median("price") as median p25("price")
as p25 p75("price") as p75
| eval IQR=(p75-p25)
| eval lowerBound=(median-IQR*20)
| eval upperBound=(median+IQR*20)
| eval isOutlier=if('price' < lowerBound
OR 'price' > upperBound, 1, 0)
| fields "_time", "symbol", "sourcetype", "time", "price",
"lowerBound", "upperBound", "isOutlier"
| sort - isOutlier
Removing outliers in charts
You can use the outlier command to remove outlying numerical values from your search results.
You have the option to remove or transform the events with outliers. The remove
option removes the events. The transform
option truncates the outlying value to the threshold for outliers. The threshold is specified with the param
option.
A value is considered an outlier if the value is outside of the param
threshold multiplied by the inter-quartile range (IQR). The default value for param
is 2.5.
Create a chart of web server events, transform the outlying values
For a timechart of web server events, transform the outlying average CPU values.
host=''web_server'' 404
| timechart avg(cpu_seconds) by host
| outlier action=transform
Remove outliers that interfere with displaying the y-axis in a chart
Sometimes when you create a chart, a small number of values are so far from the other values that the chart is rendered unreadable. You can remove the outliers so that the chart values are visible.
index=_internal source=*access*
| timechart span=1h max(bytes)
| fillnull
| outlier
Remove outliers using the three-sigma rule across transactions
This example uses the eventstats
command to calculate the average and the standard deviation. The three-sigma limit is then calculated. The where
command filters search results. Only the events with the duration less than the three-sigma limit are returned.
... | eval durationMins = (duration/60)
| eventstats avg(durationMins) as Avrg, stdev(durationMins) as StDev
| eval threeSigmaLimit = (Avrg + (StDev * 3))
| where durationMins < threeSigmaLimit
Manage alerts for outliers
When setting up an alert, it is important to review the outlier threshold you have set.
- * If the threshold value is too low, too many alerts are returned for non-critical outliers
- * If the threshold value is too high, not enough alerts are returned and you might not identify the outliers
Typically a small percentage of events in your data are outliers. If you have 1,000,000 events a day and 5% of the events are outliers, setting an alert would trigger 50,000 alert actions unless you specify a throttle. A throttle suppresses additional alerts that have the same field value in a given time range.
For example, your search returns on average 100 events every minute. You only want to be alerted when the status for an event is 404. You can setup the alert to perform the alert action one time every 60 seconds instead of alerting you for every event that has a 404 status in that 60 second window.
Set alert throttling
You set throttling as part of setting an alert.
- 1. Determine what percent of your events are outliers.
- 2. Under Settings, choose Searches, reports, and alerts.
- 3. Under Schedule and alert, click the Schedule this search check box. The screen expands to display the scheduling and alerting options.
- 4. Specify the alert condition and mode.
- 5. Mark the Throttling check box and specify when the throttling expires.
- 6. Specify the alert action.
- 7. Click Save.
See also
- Related information
- About advanced statistics
- Detecting patterns
About anomaly detection | Detecting anomalies |
This documentation applies to the following versions of Splunk Cloud Platform™: 8.2.2112, 8.2.2201, 8.2.2202, 8.2.2203, 9.0.2205, 9.0.2208, 9.0.2209, 9.0.2303, 9.0.2305, 9.1.2308, 9.1.2312, 9.2.2403, 9.2.2406 (latest FedRAMP release), 9.3.2408
Feedback submitted, thanks!