Docs » Built-in alert conditions » Outlier Detection

Outlier Detection 🔗

Outlier Detection alerts when a signal is significantly different from its peers in the same time period. Use this condition to identify inconsistent behavior among a population of emitters (within the same time period), such as which node in a cluster is using more CPU than the others.

Note

To compare current signal values to past values of the same signal, use Sudden Change or Historical Anomaly.

Example 🔗

Use this condition to determine if you have not added a host to your load balancer, or if there is a problem between the host and the load balancer. For example, if a metric tracks requests routed to a host in a load balancer, trigger an outlier alert when, for example, the value of the metric is more than 2.5 standard deviations below the mean of similar signals for 80% of 5m.

Basic settings 🔗

Parameter

Values

Notes

Alert when

Too high, Too low, Too high or Too low

Trigger Sensitivity

Low, Medium, High, Custom

Approximately how often alerts are triggered, where Low can result in fewer alerts being triggered and alerts taking longer to clear (least flappy). Choose Custom to modify the settings that determine triggering and clearing sensitivity (listed below).

Advanced settings 🔗

Paramter

Values

Notes

Define thresholds by

Deviations from norm, Norm plus percentage change

Whether to express comparison in terms of a statistic (number of deviations) or a percentage

Normal based on (when Define thresholds by is Deviations from norm)

Mean plus standard deviation, Median plus median absolute deviation

Median plus median absolute deviation is recommended for small populations (<15).

Normal defined by (when Define thresholds by is Norm plus percentage change)

Mean,` Median

Median is less influenced by extreme values.

(Optional) Group by

Dimension or property chosen from dropdown menu

Use a dimension or property when you want the norm to be different according to the different values of the dimension or property. For example, if you choose aws_availability_zone and your zones are US-east and US-west, instances in US-east are being compared only to other instances in US-east, and likewise for US-west. If you choose None, there is one norm, and all members are compared to this norm.

Trigger threshold and Clear threshold (when Define thresholds by is Deviations from norm)

Number >= 0; Clear threshold must be lower than Trigger threshold.

The number of deviations away from the norm required to trigger an alert.

For example, a trigger value of 3.5 triggers an alert when the values being compared differ from the norm by 3.5 standard deviations or more. Higher values result in lower sensitivity and potentially fewer alerts.

A clear value of 2.5 clears the alert when the values being compared differ by 2.5 standard deviations or less. Higher values result in alerts taking longer to clear.

Trigger threshold and Clear threshold (when Define thresholds by is Norm plus percentage change)

Number between 0 and 100, inclusive; Clear threshold must be lower than Trigger threshold.

The percentage change required to trigger or clear the alert.

For example, a trigger value of 30 triggers an alert when the values being compared differ by 30% or more. Higher values result in lower sensitivity and potentially fewer alerts.

A clear value of 20 clears the alert when the values being compared differ by 20% or less. A gap between Trigger threshold and Clear thresholds results in alerts taking longer to clear.

Trigger duration

Percent: Integer between 1 and 100; Time indicator: Integer >= 1, followed by time indicator (s, m, h, d, w). For example, 30s, 10m, 2h, 5d, 1w.

The number of times the signal must meet the trigger threshold, compared to the number of expected data points. Higher percentages and/or longer time periods result in lower sensitivity and potentially fewer alerts. For more information about this option, see The Duration option.

Clear duration

Percent: Integer between 1 and 100; Time indicator: Integer >= 1, followed by time indicator (s, m, h, d, w), For example, 30s, 10m, 2h, 5d, 1w.

The number of times the signal must meet the clear threshold, compared to the number of expected data points. Higher percentages and/or longer time periods result in longer times for alerts to clear, increasing confidence that the alert condition is in fact no longer occurring. For more information about this option, see The Duration option.

The Duration option 🔗

The Trigger duration and Clear duration options are used to trigger or clear alerts based on how many signals met the threshold during the specified time window, compared to how many were expected.

  • Specifying 100% means that all expected data points arrived (there were no delayed or missing data points) and all met the threshold. In other words, if you specify 100% of a time range, an alert isn’t triggered if any data points are delayed or do not arrive at all during that time range, even if all the data points that are received do meet the threshold. (For more information about delayed or missing data points, see Handle delayed or missing data points.)

    Tip

    To specify that an alert triggers immediately, specify 100% of 1 second for infrastructure detectors, and 100% of 10 seconds for µAPM detectors. If the signal resolution is greater than the value you enter, a message indicates that you need to change it to at least the value of the signal resolution.

  • Specifying a percentage below 100 has a few effects:

    • For the Alert threshold, a lower percentage is more sensitive (might trigger more alerts) than using 100%, because fewer signals are needed to trigger an alert. Also, it can trigger alerts even if some data points are missing, as long as the required number of anomalous signals arrive.

    • For the Clear threshold, it can clear alerts more quickly than using 100%, because fewer signals are needed to trigger the clear condition. Also, it can clear an alert even if some data points are missing, as long as the required number of non-anomalous signals arrive.

The following examples illustrate how this option affects triggering and clearing alerts in various situations.

Alert example 1 🔗

  • Percent of duration you specify: 100% of 10 minutes

  • Resolution of the signal: 10 seconds

  • Number of data points expected in 10 minutes: 6 per minute * 10 minutes (60)

  • Number of anomalous data points (how many times the threshold must be met) to trigger alert: 100% of 60 (60)

    Total data points expected

    Total data points received

    Anomalous data points required

    Anomalous data points received

    Alert is triggered?

    60

    60

    60

    60

    Yes

    60

    60

    60

    59 or fewer

    No

    60

    59

    60

    59

    No

    Note that in the last example above, even though 100% of the data points that arrived were anomalous, the required number of anomalous data points (60) did not arrive. Therefore, the alert isn’t triggered. The percent you specify represents percent of expected data points, not percent of received data points.

Alert example 2 🔗

  • Percent of duration you specify: 80% of 10 minutes

  • Resolution of the signal: 10 seconds

  • Number of data points expected in 10 minutes: 6 per minute * 10 minutes (60)

  • Number of anomalous data points (how many times the threshold must be met) to trigger alert: 80% of 60 (48)

    Total data points expected

    Total data points received

    Anomalous data points required

    Anomalous data points received

    Alert is triggered?

    60

    60

    48

    48-60

    Yes

    60

    50

    48

    48-50

    Yes

    60

    50

    48

    47

    No

    Note that in the last example above, even though 47/50 is greater than the 80% you specified, the required number of anomalous data points (48) did not arrive. Therefore, the alert isn’t triggered. The percent you specify represents percent of expected data points, not percent of received data points.

Clear example 1 🔗

  • Percent of duration you specify: 100% of 15 minutes

  • Resolution of the signal: 30 seconds

  • Number of data points expected in 15 minutes: 2 per minute * 15 minutes (30)

  • Number of anomalous data points (how many times the threshold must be met) to clear alert: 100% of 30 (30)

    Total data points expected

    Total data points received

    Normal data points required

    Normal data points received

    Alert is cleared?

    30

    30

    30

    30

    Yes

    30

    30

    30

    29 or fewer

    No

    30

    25

    30

    25

    No

    Note that in the last example above, even though 100% of the data points that arrived were anomalous, only 35 out of the 36 expected data points arrived. Therefore, the alert isn’t cleared. The percent you specify represents percent of expected data points, not percent of received data points.

Clear example 2 🔗

  • Percent of duration you specify: 50% of 15 minutes

  • Resolution of the signal: 30 seconds

  • Number of data points expected in 15 minutes: 2 per minute * 15 minutes (30)

  • Number of anomalous data points (how many times the threshold must be met) to clear alert: 50% of 30 (15)

    Total data points expected

    Total data points received

    Normal data points required

    Normal data points received

    Alert is cleared?

    30

    30

    15

    15-30

    Yes

    30

    20

    15

    15-20

    Yes

    30

    20

    15

    14

    No

    Note that in the last example above, even if 14 anomalous data points arrive, and 14/15 is greater than the 50% you specified, the required number of anomalous data points (15) did not arrive. Therefore, the alert isn’t triggered. The percent you specify represents percent of expected data points, not percent of received data points.

Further reading 🔗

Parameters

Remarks

Alert when

The setting “Too high or Too low” triggers an alert for a signal that oscillates between above and below the bands (provided of course it spends enough time outside of the band).

Trigger and clear duration

Set these parameters to be significantly larger than native resolution.

Trigger threshold and Outlier algorithm

Mean plus standard deviation never triggers an alert for n standard deviations if n^2 + 1 is greater than or equal to the size of the population being monitored. Therefore, Median plus median absolute deviation is recommended for small populations (n <  15).

Trigger threshold and clear threshold

These produce dynamic thresholds, which can be somewhat disorienting. For example, an alert can be triggered when the signal value is 31.4 (units of the original metric, not deviations) and clear when the value is 55.1 (because the rest of the population now also shows elevated values).