Splunk® IT Service Intelligence

Administration Manual

Acrobat logo Download manual as PDF


Splunk IT Service Intelligence version 4.0.x reached its End of Life on January 19, 2021. See the Splunk Software Support Policy for details. For information about upgrading to a supported version, see Plan an upgrade of IT Service Intelligence.
This documentation does not apply to the most recent version of ITSI. Click here for the latest version.
Acrobat logo Download topic as PDF

Detect anomalous KPI behavior in ITSI

ITSI provides anomaly detection algorithms that learn KPI patterns continuously in real time, and detect when KPIs depart from their own historical behavior. You can use anomaly detection to identify trends and outliers in KPI search results that might indicate an issue with your system.

Prerequisite

Anomaly detection in ITSI 2.5.0 or later requires Java 7 or Java 8. If you are using Java 7, any version of ITSI running on Splunk Enterprise version 6.6.0 or later also requires Java Cryptography Engine (JCE). See Install required Java components in this manual.

Anomaly detection algorithms

ITSI provides two anomaly detection algorithms: trending and entity cohesion.

Algorithm Description
Trending The trending algorithm detects anomalous patterns in a single time series (or metric). A sliding window on the time series is monitored by a scoring function (based on non-parametric statistics), which continuously generates scores to reflect the anomalousness of patterns in the current window, compared to patterns in historical data. Thresholds are computed adaptively without any distribution assumptions and are robust to outliers. Anomalously high scores generate an alert.

Trending anomaly detection applies to aggregate KPI events, and is useful for tracking anomalous KPI behavior on the service level.

Entity cohesion The entity cohesion (cohesive) algorithm detects anomalous patterns in multiple time series simultaneously. The group of time series the algorithm monitors is assumed to have similar or "cohesive" behavior and patterns. A scoring function continuously monitors all time series within a sliding window and generates scores for each time series to reflect its departures in pattern from the rest of the time series.

Significant departures of a time series from its cohesive peers are indicated by high anomaly scores, which trigger alerts using similar thresholding techniques as the trending algorithm.

Entity cohesion anomaly detection applies to KPIs that are shared across multiple entities (4 minimum), and is useful for tracking anomalous behavior on the entity level. To use entity cohesion a KPI must be split by entity.

Analyze KPI data and enable algorithms

Before you enable anomaly detection algorithms, it is a best practice to use the Analyze KPI Data tool to confirm that your KPI is recommended. Use this tool to determine if the algorithms will produce meaningful results for the KPI based on specific criteria. KPIs that do not meet the criteria for the algorithm are likely to generate false positives and are not recommended.

You can enable anomaly detection algorithms for any KPI irrespective of the analysis results. The only scenario in which you cannot enable the algorithm is in the case of entity cohesion, if the KPI is not split by entity.

  1. Click Configure > Services.
  2. Open the service containing the KPI for which you want to apply anomaly detection.
  3. Select the KPI.
  4. Expand the Search and Calculate panel.
  5. In the Unit row, click Edit
  6. Select Enable backfill and define the backfill period over which you want to analyze KPI data.  
  7. Click Finish.
  8. Expand the Anomaly Detection panel.
  9. In Analysis Time Window, select the time range for KPI data analysis.
  10. Click Analyze KPI Data.
    • If Algorithm Analysis Result shows Recommended, then the KPI meets the criteria for the algorithm.
    • If Algorithm Analysis Result shows a warning message, then the KPI does not meet the criteria for use with the algorithm. Mouse over the tooltip to learn more about the algorithm requirements. See Algorithm analysis criteria below.
      EntityCohesion.png
  11. For each recommended algorithm, click Yes to enable it.
  12. Adjust the Algorithm Sensitivity slider to set the algorithm's sensitivity to variance in data. The more sensitive the algorithm is, the more likely it is to generate an anomalous event. The algorithm now evaluates the KPI data continuously and generates anomalous events based on the algorithm sensitivity threshold.
  13. Click Save.

Algorithm analysis criteria

The table shows the specific criteria a KPI must meet to be recommended for use with the respective algorithm.

KPI Criteria Trending algorithm Entity Cohesion algorithm
Min amount of data 24 hrs. 24 hrs.
% of anomalous data points < 10% < 10%
Min number of entities N/A 4 entities min
Max number of entities N/A 30 entities per KPI

Triage and investigate anomalous events

An anomalous event is an event that is inconsistent with or deviating from what is usual. ITSI generates a notable event in Episode Review when it detects an anomalous event. You can then open the event in a deep dive to perform root cause analysis.

  • The trending algorithm generates notable events with the heading "Service level alert on KPI."
  • The entity cohesion (cohesive) algorithm generates notable events with the heading "Entity level alert on KPI."

The type of algorithm that generated the notable event appears in the Details section of the Overview tab.

For more information, see Overview of Episode Review in ITSI in the Splunk ITSI User Manual.

Open anomalous events in a deep dive

You can drill down to a deep dive from any anomalous notable event. This lets you view the event over a default 10 minute time range, and perform root cause analysis in the context of other service KPIs.

  1. Select the anomalous notable event in Episode Review.
  2. Under Drilldowns, click the Drilldown to <service_name> link. A deep dive opens with an overlay of the anomaly in the KPI lane.
  3. (Optional) Add additional KPIs to the deep dive for contextual analysis of the anomalous event.

For more information, see Add anomaly overlays in the Splunk ITSI User Manual.

Set max entity limit

The entity cohesion anomaly detection algorithm supports a maximum of 30 entities per KPI. If you run KPI analysis against a KPI that has more than the 30 entities, a warning message appears stating that the KPI has too many entities, and the KPI is not recommended.

If you want to lower the maximum number of entities at which KPI analysis triggers a warning message, lower the value of metrics_maximum in the [cohesive] stanza of $SPLUNK_HOME/etc/apps/SA-MetricAD/local/mad.conf. For example:

metrics_maximum = 15

Change memory configuration in SA_ITSI_MetricAD

The default memory configuration for anomaly detection is set to 1GB, which supports up to 1000 KPIs for trending analysis or up to 1000 metrics for cohesive analysis. To support more than 1000 KPIs for trending analysis or 1000 metrics for cohesive analysis, increase the Heap memory size in SA-ITSI-MetricAD/local/command.conf on your search heads. To determine the amount of memory you should use, size your analysis requirement first, then calculate your memory requirements based on that.

Trending anomaly detection requires about 600MB per 1000 KPIs. Cohesive anomaly detection requires about 1GB per 1000 metrics (for example, a combination of 10 KPIs with 100 entities or 20 KPIs with 50 entities).

Note that these recommendations should be used as a guideline only as memory usage for anomaly detection can be influenced by many factors such as the size of historical data, algorithm configurations, and available CPU.

On each search head:

  1. Go to $SPLUNK_HOME/etc/apps/SA-ITSI-MetricAD/local/.
  2. Edit commands.conf.
  3. In the [MAD] stanza, change command.arg.1=-J-Xmx1G to increase the memory.
  4. Restart Splunk Enterprise.
Last modified on 06 May, 2019
PREVIOUS
Apply adaptive thresholds to a KPI in ITSI
  NEXT
Create KPI base searches in ITSI

This documentation applies to the following versions of Splunk® IT Service Intelligence: 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.0.4, 4.1.0, 4.1.1, 4.1.2, 4.1.5


Was this documentation topic helpful?

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters