Splunk® Machine Learning Toolkit

User Guide

This documentation does not apply to the most recent version of Splunk® Machine Learning Toolkit. For documentation on the most recent version, go to the latest release.

Detect Numeric Outliers Classic Assistant

Classic Assistants enable machine learning through a guided user interface. The Detect Numeric Outliers Classic Assistant determines values that appear to be extraordinarily higher or lower than the rest of the data. Identified outliers are indicative of interesting, unusual, and possibly dangerous events. The Detect Numeric Outliers Assistant is restricted to one numeric data field.

In the following visualization, the yellow dots indicate outliers.

This visualization illustrates a time series line chart with yellow dots that mark the numeric outliers.

Algorithms

The Detect Numeric Outliers Classic Assistant is compatible with the following distribution statistics:

  • Standard deviation
  • Median absolute deviation
  • Interquartile range

Detect Numeric Outliers

Input the data and select the parameters you want to investigate. When a situation violates the expectations for a parameter, it results in an outlier.

Workflow

Follow these steps for the Detect Numeric Outliers Classic Assistant.

  1. From the MLTK navigation bar select Classic > Assistants > Detect Numeric Outliers.
  2. Run a search, and be sure to select a date range.
  3. Select a numeric field in Field to analyze.
    The list populates every time you run a search.
  4. Select a method in Threshold method.
    In picking a method, consider both the distribution of the data, as well as how the method impacts outlier detection. Use the following table to guide your decision:
    Method Application
    Standard Deviation This method is appropriate If your data exhibits a normal distribution. Since the standard deviation method centers on the mean, it is more impacted by outliers.
    Median Absolute Deviation This method applies a stricter interpretation of outliers than standard deviation because the measurement centers on the median and uses Median Absolute Deviation (MAD) instead of standard deviation.
    Interquartile Range This method is appropriate when your data exhibits an asymmetric distribution. Instead of centering the measurement on a mean or median, it uses quartiles to determine whether a value is an outlier.
  5. Specify a value for Threshold multiplier.
    The larger the number, the larger the outlier envelope, and therefore, the fewer the outliers.
  6. (Optional) In the Sliding window field, specify the number of values to use to compute each slice of the outlier envelope.
    A sliding window is useful if the distribution of your data changes frequently. If you do not specify a sliding window, the assistant uses the whole dataset which results in an outlier envelope of uniform size.
  7. Select Include current point to include the current point in the calculations before assessing whether it is an outlier.
  8. (Optional) In Fields to split by, select up to 5 fields.
    In the visualizations the data points are grouped by field, or if more than one split by field is specified, by the combination of the values of the fields. It is better to split by a categorical field than a numeric field. For example, if you detect outliers in grocery store purchases and analyze the quantity field, you could split by store_ID to group the quantity data points by store.
  9. Click Detect Outliers.

Interpret and validate

After you detect outliers, review your results in the tables and visualizations. Results commonly have a few outliers.

Result Application
Data and Outliers Outliers, represented by yellow dots, are datapoint that fall outside of the light blue envelope. A chart to the right of the graph reports the total number of outliers. Hover over a yellow dot to see the value and quantity of an outlier. To learn more about the nature of the outlier, click it to drill down to the search query to see the base of the data point.
Split by fields If you selected a field to split by, then the Data and Outliers chart displays up to 10 values that you can add or remove from the chart. The chart groups the data points based on split by field values. For example, if there are 3 split by field values, the chart will be broken out into 3 separate charts for each split value. The number of outliers for each split by field value is displayed to the right of the chart.
Outlier Count Over Time chart This chart plots the outliers over time, and only appears if you use time series data. If you specify more than one split, the chart shows the outlier count for each field value. To see which values are too high or too low, check the box for Split outliers above and below threshold. If you split by field, each field contains a value for outliers above and below the threshold.
Data Distribution histogram This histogram shows the distribution, and displays the number of data points within the threshold (the light blue area) and the number of data points outside the threshold.
Data and Outliers table This table shows each outlier the corresponding value, as well as the the lists of values for any split by field.
Outlier Split Value Distribution If you specified one or more split by fields, this table displays the number of outliers for each split value or combination of split values.

Deploy numeric outlier detection

  1. Click Open in Search to to generate a New Search tab for this same dataset. This new search will open in a new browser tab, away from the Classic Assistant.
    This search query that uses all data, not just the training set. You can adjust the SPL directly and see results immediately. You can also save the query as a Report, Dashboard Panel or Alert.
  2. Click Show SPL to generate a new window showing the search query that was used to calculate the outliers. Copy the SPL here for use in other aspects of your Splunk instance.

Once you navigate away from the Classic Assistant page, you cannot return to it through the Classic or Models tabs. Classic Assistants are great for generating SPL, but may not be ideal for longer-term projects.

For more information about alerts, see Getting started with alerts in the Splunk Enterprise Alerting Manual.

Last modified on 24 August, 2018
Predict Categorical Fields Classic Assistant   Detect Categorical Outliers Classic Assistant

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 3.4.0, 4.0.0, 4.1.0, 4.2.0, 4.3.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters