Splunk® Machine Learning Toolkit

User Guide

# Detect Numeric Outliers Classic Assistant workflow

Classic Assistants enable machine learning through a guided user interface. The Detect Numeric Outliers Classic Assistant determines values that appear to be extraordinarily higher or lower than the rest of the data. Identified outliers are indicative of interesting, unusual, and possibly dangerous events. The Detect Numeric Outliers Assistant is restricted to one numeric data field.

In the following visualization, the yellow dots indicate outliers.

## Algorithms

The Detect Numeric Outliers Classic Assistant is compatible with the following distribution statistics:

• Standard deviation
• Median absolute deviation
• Interquartile range

## Detect Numeric Outliers

Input the data and select the parameters you want to investigate. When a situation violates the expectations for a parameter, it results in an outlier.

### Workflow

Follow these steps for the Detect Numeric Outliers Classic Assistant.

1. From the MLTK navigation bar select Classic > Assistants > Detect Numeric Outliers.
2. Run a search, and be sure to select a date range.
3. Select a numeric field in `Field to analyze`.
The list populates every time you run a search.
4. Select a method in `Threshold method`.
In picking a method, consider both the distribution of the data, as well as how the method impacts outlier detection. Use the following table to guide your decision:
Method Application
Standard Deviation This method is appropriate If your data exhibits a normal distribution. Since the standard deviation method centers on the mean, it is more impacted by outliers.
Median Absolute Deviation This method applies a stricter interpretation of outliers than standard deviation because the measurement centers on the median and uses Median Absolute Deviation (MAD) instead of standard deviation.
Interquartile Range This method is appropriate when your data exhibits an asymmetric distribution. Instead of centering the measurement on a mean or median, it uses quartiles to determine whether a value is an outlier.
5. Specify a value for `Threshold multiplier`.
The larger the number, the larger the outlier envelope, and therefore, the fewer the outliers.
6. (Optional) In the `Sliding window` field, specify the number of values to use to compute each slice of the outlier envelope.
A sliding window is useful if the distribution of your data changes frequently. If you do not specify a sliding window, the assistant uses the whole dataset which results in an outlier envelope of uniform size.
7. Select `Include current point` to include the current point in the calculations before assessing whether it is an outlier.
8. (Optional) In `Fields to split by`, select up to 5 fields.
In the visualizations the data points are grouped by field, or if more than one split by field is specified, by the combination of the values of the fields. It is better to split by a categorical field than a numeric field. For example, if you detect outliers in grocery store purchases and analyze the `quantity` field, you could split by `store_ID` to group the `quantity` data points by store.
9. Click Detect Outliers.

## Interpret and validate

After you detect outliers, review your results in the tables and visualizations. Results commonly have a few outliers.

Result Application
Data and Outliers Outliers, represented by yellow dots, are data points that fall outside of the light blue envelope. A chart to the right of the graph reports the total number of outliers. Hover over a yellow dot to see the value and quantity of an outlier. To learn more about the nature of the outlier, click it to drill down to the search query to see the base of the data point.
Split by fields If you selected a field to split by, then the Data and Outliers chart displays up to 10 values that you can add or remove from the chart. The chart groups the data points based on split by field values. For example, if there are 3 split by field values, the chart will be broken out into 3 separate charts for each split value. The number of outliers for each split by field value is displayed to the right of the chart.
Outlier Count Over Time chart This chart plots the outliers over time, and only appears if you use time series data. If you specify more than one split, the chart shows the outlier count for each field value. To see which values are too high or too low, check the box for Split outliers above and below threshold. If you split by field, each field contains a value for outliers above and below the threshold.
Data Distribution histogram This histogram shows the distribution, and displays the number of data points within the threshold (the light blue area) and the number of data points outside the threshold.
Data and Outliers table This table shows each outlier the corresponding value, as well as the the lists of values for any split by field.
Outlier Split Value Distribution If you specified one or more split by fields, this table displays the number of outliers for each split value or combination of split values.

## Deploy numeric outlier detection

1. Click Open in Search to to generate a New Search tab for this same dataset. This new search will open in a new browser tab, away from the Classic Assistant.
This search query uses all your data, not just the training set. You can adjust the SPL directly and see results immediately. You can also save the query as a Report, Dashboard Panel, or Alert.
2. Click Show SPL to generate a new window showing the search query that was used to calculate the outliers. Copy the SPL here for use in other aspects of your Splunk instance.

Once you navigate away from the Classic Assistant page, you cannot return to it through the Classic or Models tabs. Classic Assistants are great for generating SPL, but may not be ideal for longer-term projects.