Splunk® IT Service Intelligence

Administer Splunk IT Service Intelligence

Download manual as PDF

Download topic as PDF

Apply adaptive thresholds to a KPI in ITSI

Adaptive thresholding uses machine learning techniques to analyze historic data and determine what should be considered normal in your IT environment. Since the shape of your data can vary dramatically, ITSI supports standard deviation, quantile, and range-based thresholds. The adaptive thresholds automatically recalculate on a nightly basis so that slow changes in behavior don't trigger false alerts.

By dynamically calculating time-dependent thresholds, adaptive thresholding allows operations to more closely match alerts to the expected workload on an hour-by-hour basis.

When to use adaptive thresholds

Consider the following guidelines when deciding whether to enable adaptive thresholding for a KPI:

  • Because adaptive thresholding looks for historic patterns in your data, it is best to enable it for KPIs that have established baselines of data points and show a pattern or trend over time.
  • Make sure your historic data isn't too random. If the historic data is noisy, a pattern will be difficult to detect.

How are thresholds calculated?

When you click Apply Adaptive Thresholding for a KPI, the thresholds are calculated based on the specific KPI that you're previewing. If you like the calculated thresholds, you can save the template and use it. However, if you apply that same template to other KPIs with different ranges of results, such as results in the 0-100 range versus 1000-2000, you will not see immediately useful thresholds. Instead, the preview displays the KPI scores with the thresholds calculated for the original KPI.

To ensure that other KPIs use a threshold template with adaptive thresholding correctly, you can do one of the following:

  • Immediately calculate the thresholds for each individual KPI using the template by clicking Apply Adaptive Thresholding.
  • Apply the template and let the modular input that runs at midnight calculate and update the thresholds. When the modular input runs, it updates the KPI's local copy of the thresholding template.

Each night at midnight, ITSI calculates adaptive threshold values for a KPI by organizing the data from the training window into distinct buckets and then analyzing each bucket separately. ITSI creates empty buckets for each time block of the threshold template you select. Each bucket is then populated with all data points from that time block for every day of the training window.

After all the data from the last 7 days is distributed and organized into buckets, ITSI applies the selected algorithm to each bucket of data to calculate the threshold value. For descriptions of each algorithm type, see Create KPI threshold time policies in ITSI.

Performance considerations

Adaptive thresholds are automatically recalculated on a nightly basis so that gradual changes in behavior don't trigger false alerts. Each time a threshold is recalculated, the service must be re-saved into the its_services KV store collection. This process can moderately impact performance. The performance impact increases if a lot of your services contain KPIs with Adaptive Thresholding enabled.

The individual threshold calculations performed for each KPI do not have a significant impact on ITSI performance.

Prerequisites

  • You must have the write_itsi_kpi_threshold_template capability to apply adaptive thresholds to a KPI. The itoa_admin role is assigned this capability by default.
  • Before you apply adaptive thresholds, it is best to decide which algorithm you want or need based on the descriptions in Create KPI threshold time policies in ITSI.

Apply adaptive thresholds to a KPI

The following scenario walks through the process of configuring adaptive thresholding for a KPI. The sample KPI represents logins to a web server that exhibits different behaviors each day of the week and each hour of the day.

In this example, the training window is 7 days. However, as you identify smaller and smaller time policies, you might need to increase it to 14, 30, or 60 days to ensure that you have adequate data points in your short time windows to generate meaningful threshold values.

To begin your policy configuration, you must decide on the severity parameters for the chosen adaptive thresholding algorithm that align with your severity definitions. You've determined the following information about your data:

  • Quantile is the right algorithm for this KPI.
  • >95% is the high threshold.
  • <5% is the medium threshold.

AT1.png

You click Apply Adaptive Thresholding.

ATPreview1.png

The first thing you're likely to notice when looking at the week-long KPI graph is that certain times of the day or days of the week are predictably different than other times. Perhaps AM differs from PM, or weekends differ from weekdays. These variations are almost always explainable and expected, but you should work with the service owners to confirm.

Presuming the variation is expected, the next step is to create a time policy to encapsulate that difference. In your case, you expect weekend traffic to your site to be very light. You start by separating weekend traffic from the work week with a new time policy. Apply the same adaptive threshold algorithm and severity values to your new time policy, and apply adaptive thresholds again.

AT2.png

ATPreview2.png

ITSI only uses the historical data points within that time policy to determine the threshold values. Thus the difference is now better accounted for.

It's clear that you've made improvements, but you still see problems. There are some spikes going into the red on Monday. After working with the service team, they tell you that logins predictably spike around 8am and 5pm most every day of the work week. You can create time policies to isolate those spikes. You can also create time policies to isolate the work week evenings where things are quieter.

AT3.png

ATPreview3.png

The thresholds might not be perfect and you'll probably have to continue this process to create the right number of time policies. However, you've applied a methodical approach and can justify the purpose of each time policy.

Configure alerts for abnormal thresholds

After you threshold your KPIs with adaptive thresholds, you need to consider what type of alert configurations make sense to transform abnormal KPI results into actionable alerts.

The following are two common alert strategies:

  • Alert when a KPI is exhibiting extremely abnormal behavior.
  • Alert when multiple KPIs are simultaneously exhibiting abnormal behavior.

However, depending on the algorithm and threshold values you choose, you might not be able to determine that a KPI is extremely abnormal. This is particularly true if you select a quantile algorithm. If that's the case, consider alerting when a KPI spends an excessive amount of time in an abnormal state.

When creating alerts based on the normalcy of multiple KPIs, try to identify two or three KPIs that are highly indicative of service health and create an alert only when most or all of them start to exhibit abnormal behavior. For example, to identify looming service issues, you might alert based on abnormal results from KPIs for count of errors in a log file and the number of successful logins.

You might also want to consider looking at multiple KPIs across two or more critical tiers of your service. For instance, if you're seeing abnormal error counts in your web tier and your services tier, you might have an issue.

See also

PREVIOUS
Create KPI threshold time policies in ITSI
  NEXT
Receive alerts when KPI severity changes in ITSI

This documentation applies to the following versions of Splunk® IT Service Intelligence: 4.2.0, 4.2.1, 4.2.2, 4.2.3, 4.3.0, 4.3.1


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters