Create adaptive KPI thresholds in ITSI

Adaptive thresholding in IT Service Intelligence (ITSI) uses machine learning techniques to analyze your data and automatically adjust threshold values based on historical data and current conditions. Unlike static thresholds that don't account for changes in your environment, adaptive thresholds account for varying patterns in data over time. Since data patterns can change dramatically, ITSI supports standard deviation, quantile, range-based, and percentage thresholds. The adaptive thresholds automatically recalculate on a nightly basis so that changes in KPI behavior don't trigger false alerts.

By dynamically calculating time-dependent thresholds, adaptive thresholding allows operations to more closely match alerts to the expected workload on an hour-by-hour basis.

Adaptive thresholds are intended to analyze and predict behavior for KPIs only. You currently can't perform adaptive thresholding on a per-entity basis. If you want to perform thresholding at the per-entity level, you must use the standard thresholding procedure. For instructions, see Set per-entity thresholds.

Prerequisites

You must have the write_itsi_kpi_threshold_template capability to apply adaptive thresholds to a KPI. The itoa_admin role is assigned this capability by default.
Before you apply adaptive thresholds, decide which algorithm you want or need based on the descriptions in Create time-based static KPI thresholds in ITSI.

When to use adaptive thresholds

Consider the following guidelines when deciding whether to enable adaptive thresholding for a KPI:

Because adaptive thresholding looks for historic patterns in your data, it is best to enable it for KPIs that have established baselines of data points and show a pattern or trend over time.
Make sure your historic data isn't too random. If the historic data is noisy, a pattern will be difficult to detect.

How are thresholds calculated?

Adaptive thresholds for a KPI are calculated based on the specific KPI that you're previewing. If you like the calculated thresholds, you can save the template and use it. However, if you apply that same template to other KPIs with different ranges of results, such as results in the 0-100 range versus 1000-2000, you won't see immediately useful thresholds. Instead, the preview displays the KPI scores with the thresholds calculated for the original KPI.

To ensure that other KPIs use a threshold template with adaptive thresholding correctly, you can do either of the following:

Calculate the thresholds for each individual KPI using the template by clicking Apply Adaptive Thresholding. Clicking Apply Adaptive Thresholding assigns the calculation for the statistical policy type you've chosen (such as standard deviation) to the selected KPI and previews the results for you in a graph on the same page. To make the calculation you've previewed part of the threshold template, click Save.
Apply the template and let the search that runs by default at midnight calculate and update the thresholds. The search updates the KPI's local copy of the threshold template.

Each night at midnight, ITSI recalculates adaptive threshold values for a KPI by organizing the data from the training window into distinct buckets and then analyzing each bucket separately. ITSI creates empty buckets for each time block of the threshold template you select. Each bucket is then populated with all data points from that time block for every day of the training window.

After all the data from the last 7 days is distributed and organized into buckets, ITSI applies the selected algorithm to each bucket of data to calculate the threshold value. For descriptions of each algorithm type, see Create time-based static KPI thresholds in ITSI.

Why aren't thresholds being recalculated?

By default, adaptive thresholds are recalculated each night based on the previous seven days of data. However, you might occasionally see that ITSI fails to recalculate thresholds at midnight, causing health scores to be high or critical because they're out of range. The thresholds are automatically calculated by a search called itsi_at_search_kpi_minus7d that you can run manually if you experience problems.

To run the search manually, perform the following steps:

Navigate to Settings > Searches, reports, and alerts.
Locate the itsi_at_search_kpi_minus7d search.
Click Run.
If the search fails, decrease the time range to 3600 minutes and re-run the search.
After the search runs successfully, check the adaptive thresholds in your environment and confirm that they've been recalculated.

You can reapply adaptive threshold settings for a specific KPI by selecting that KPI and then doing the following:

Click Apply Adaptive Thresholding.
Click Save.

The itsi_at_search_kpi_minus7d search calculates each individual KPI's adaptive thresholds, not the thresholds in the KPI threshold template.

Detect and remove outliers in adaptive thresholds

Unplanned service outages, service degradation, or other disruptions from historical data can skew your adaptive thresholds. Toggle the Enable outlier exclusion button to set the data points that should be excluded from adaptive threshold calculations.

Note: You must have a KPI with historical data and adaptive thresholding turned on to exclude the outliers.

To detect and remove outliers from your adaptive thresholds, perform the following steps:

Turn on the Enable outlier exclusion toggle. The KPI should have historical data or be backfilled in order for adaptive thresholding to work.

Small dots on the chart in the Preview Aggregate Thresholds section represent outliers that will be excluded from your historical data. Zoom out to view your data points at different time intervals. Configure the following settings:

Option

Description

Outlier algorithm

The statistical method used to identify and classify data points as outliers. This method will identify the lower and upper bounds of your data points and these points are the excluded outliers. Select one of the following statistical method that best fits your data:

Standard Deviation
Interquartile Range
Mean absolute deviation

For more information about these algorithms, see Available KPI threshold templates.

Trigger threshold

This setting controls the sensitivity of the algorithm selected to identify outliers. You can adjust this to decrease the outlier count in the case that there are too many outliers detected. Too many outliers can skew your thresholds and generate false alerts.

Select the button Preview Adaptive Thresholds to preview your adaptive threshold values with the outliers excluded.
Select Save to save your changes.

Performance considerations

Adaptive thresholds are automatically recalculated on a nightly basis so that gradual changes in behavior don't trigger false alerts. Each time a threshold is recalculated, the service must be re-saved into the its_services KV store collection. This process can moderately impact performance. The performance impact increases if a lot of your services contain KPIs with Adaptive Thresholding enabled.

The individual threshold calculations performed for each KPI do not have a significant impact on ITSI performance.

Configure alerts for abnormal thresholds

After you threshold your KPIs with adaptive thresholds, you need to consider what type of alert configurations make sense to transform abnormal KPI results into actionable alerts.

The following are two common alert strategies:

Alert when a KPI is exhibiting extremely abnormal behavior.
Alert when multiple KPIs are simultaneously exhibiting abnormal behavior.

However, depending on the algorithm and threshold values you choose, you might not be able to determine that a KPI is extremely abnormal. This is particularly true if you select a quantile algorithm. If that's the case, consider alerting when a KPI spends an excessive amount of time in an abnormal state.

When creating alerts based on the normalcy of multiple KPIs, try to identify two or three KPIs that are highly indicative of service health and create an alert only when most or all of them start to exhibit abnormal behavior. For example, to identify looming service issues, you might alert based on abnormal results from KPIs for count of errors in a log file and the number of successful logins.

You might also want to consider looking at multiple KPIs across two or more critical tiers of your service. For instance, if you're seeing abnormal error counts in your web tier and your services tier, you might have an issue.

Related answers from Splunk Community

Create adaptive KPI thresholds in ITSI

Prerequisites

When to use adaptive thresholds

How are thresholds calculated?

Why aren't thresholds being recalculated?

Detect and remove outliers in adaptive thresholds

Performance considerations

Configure alerts for abnormal thresholds

Comments

Create adaptive KPI thresholds in ITSI

Was this topic useful?