Detect Numeric Outliers
The Detect Numeric Outliers assistant determines values that appear to be extraordinarily higher or lower than the rest of the data. Identified outliers are indicative of interesting, unusual, and possibly dangerous events. This assistant is restricted to one numeric data field.
Algorithm
- Distribution statistics (standard deviation, median absolute deviation, interquartile range)
Workflow
To detect a numerical outlier, you input data and select the parameters to look for. When expectations are violated, the result is an outlier. The basic steps are as follows:
- Enter a search to retrieve your data, then click the search button to run it.
- Select the numeric field for which to detect outliers as the field to analyze. This list of fields is populated by the search you just ran.
- Select a value for threshold method. Select a method based on the distribution of the data and the impact you'd like outliers to have. Select Standard Deviation if your data is normally distributed and you don't mind outliers having a big impact on the outlier threshold. Otherwise, if you want more robustness to outliers, try the other methods.
- Specify a value for threshold multiplier. The larger the number, the larger the outlier envelope (and therefore, the fewer the outliers).
- Select whether or not to use a sliding window and specify the number of values to use to compute each slice of the outlier envelope. Select Include current point to include the current point in the statistical calculations before assessing whether it is an outlier. Using a sliding window is useful if the distribution of your data changes frequently. If you don't specify a sliding window, the envelope is computed using the entire dataset at once, therefore creating an outlier envelope with a uniform size.
- Optionally, select up to 5 fields to split by. In the visualizations, the data points of the field you are analyzing will be grouped by the values of the field specified here or, if more than one split by field is specified, by the combination of the values of the fields. Categorical fields, as opposed to numeric fields, work best. For example, if you are detecting outliers in grocery store purchases and analyzing the
quantity
field, you could split bystore_ID
to group thequantity
data points by store. - Click Detect Outliers.
Interpret and validate
After you detect outliers, review the visualizations to see how many outliers are identified. The expectation is to have a few outliers.
- Data and Outliers chart: Displays a graph of values, where values that fall outside of the blue envelope are denoted by a yellow dot (the outliers). The number of outliers found is displayed to the right of the chart. Hover over a dot to display the value and quantity of the outlier. Click the dot to drill down and display a search query that shows the base data of the point. When the point is an outlier, you can learn more about the nature of the outlier point.
- With Split by fields: If you have specified one or more fields to split by, up to 10 split by field values are displayed under the chart heading. If there are more than 10 split values, you can click in the list to see a dropdown list of all the values. You can remove or change the values you want to display in the chart, up to a maximum of 10. The data points in the chart are grouped by the values of the split by field or the combination of split by field values if there is more than one. For example, if there are 3 split by field values, the chart will be broken out into 3 separate charts for each split value. The number of outliers for each split by field value is displayed to the right of the chart.
- Outlier Count Over Time chart: This chart displays only if you are using time series data and shows the outlier count plotted by time. If one or more split by fields is specified, it shows the outlier count for each split field value or combination of split field values. Check the Split outliers above and below threshold box to see the outliers that are above the threshold (the values that are too high) in a different color from the outliers that are below the threshold (the values that are too low). If using split by fields, each split by field value will have one value for outliers above the threshold and one value for outliers below the threshold.
- Data Distribution histogram: Displays the number of data points within the threshold (the light blue area) and the number of data points outside the threshold.
- Data and Outliers table: Displays a table of the outliers and their values. Also lists the values of any split by fields.
- Outlier Split Value Distribution: If you specified one or more split by fields, this table displays the number of outliers for each split value or combination of split values.
Deploy outlier detection
Once you have detected outliers, you can take the following actions:
- Click the Open in Search button next to the Detect Outliers button to open a new Search tab, filled out with a search query that uses all data (not just the training set).
- Click the Show SPL button next to the Open in Search button to see the search query that was used to detect outliers. For example, you could use this same query on a different data set.
- Click the Schedule Alert button in a panel to set up an alert when the number of outliers outside both thresholds, above the upper threshold, or below the lower threshold exceeds a specified value. After you save the alert, you can access it from the Scheduled Jobs > Alerts menu.
- Click any title to go to a new Search tab, filled out with a search query to replicate the outlier detection calculations.
Predict Categorical Fields | Detect Categorical Outliers |
This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 2.3.0
Feedback submitted, thanks!