Introduction to alerts and detectors in Splunk Observability Cloud 🔗
A detector monitors a signal for conditions that you define as the type of events you want to be alerted about. Rules trigger an alert when the conditions in those rules are met. Individual rules in a detector are labeled according to severity: Info, Warning, Minor, Major, and Critical.
For example, a detector that monitors the latency of an API call may go into a critical state when the latency is significantly higher than normal, as defined in the detector rules.
Detectors evaluate streams against a specific condition over a period of time. When you apply analytics to a MTS it produces a stream. A stream is an object of SignalFlow query language. The MTS can contain raw data or the output of an analytics function.
When data in an input MTS matches a condition, the detector generates a trigger event and an alert that has a specific severity level. You can configure an alert to send a notification using Splunk On-Call. For more information, see the Splunk On-Call documentation.
Built-in content 🔗
The Observability Cloud offers built-in content such as dashboards. From a dashboard, you can make charts, and alerts, which include detectors. To access Built-in content, in the Observability Cloud go to Main Menu > Dashboards > Built-in.
Using metadata in detectors 🔗
The metadata associated with MTS can be used to make detector definition simpler, more compact, and more resilient.
For example, if you have a group of 30 virtual machines that are used to provide a clustered service like Kafka, you normally include the dimension
service:kafka with all of the metrics coming from those virtual machines.
If you want to track whether the CPU utilization remains below 80 for each of those virtual machines, you can create a single detector that queries for the CPU utilization metrics that include the
service:kafka dimension and evaluates those metrics against the threshold of 80. This single detector triggers individual alerts for each virtual machine whose CPU utilization exceeds the threshold, as if you had 30 separate detectors. You do not need to create 30 individual detectors to monitor each of your 30 virtual machines.
If the population changes because the cluster has grown to 40 virtual machines, you can make a cluster- or service-level detector. If you have included the
service:kafka dimension for the newly-added virtual machines, the existing detector’s query includes all new virtual machines in the cluster in the threshold evaluation.
Dynamic threshold conditions 🔗
Setting static values for detector conditions can lead to noisy alerting because the appropriate value for one service or for a particular time of day may not be suitable for another service or a different time of day. For example, if your applications or services contain an elastic infrastructure, like Docker containers or EC2 autoscaling, the values for your alerts might vary by time of day.
You can define Dynamic thresholds to account for changes in streaming data. For example, if your metric exhibits cyclical behavior, you can define a threshold that is a one-week timeshifted version of the same metric. Suppose the relevant basis of comparison for your data is a population’s behavior like a clustered service. In that case, you can define your threshold as a value that reflects that behavior. For example, the 90th percentile for the metric across the entire cluster over a moving 15-minute window. For more information, see Detectors and Alerts.