Docs » Splunk Observability Cloudのアラートとディテクターの概要 » Best practices for creating detectors in Splunk Observability Cloud

Best practices for creating detectors in Splunk Observability Cloud 🔗

Splunk Observability Cloud uses detectors to set conditions that determine when to send an alert or notification to the appropriate team members. Detectors evaluate metric time series against a specified condition, and optionally for a duration. When a condition is met, detectors generate events with a level of severity. Severity levels are Info, Warning, Minor, Major, and Critical. These events are alerts that can trigger notifications in incident management platforms, such as PagerDuty, or messaging systems, such as Slack or email.

Using static thresholds 🔗

The most basic kind of alert triggers immediately when a simple metric crosses a static threshold. An example is anytime CPU utilization goes above 70%. Fixed thresholds are easy to implement and interpret when there are absolute goals to measure against. For example, if you know the typical memory per CPU profile of a certain application, you can define bounds that define normal state. Or, if you have a business requirement to serve requests within a certain time period, you know what is an unacceptable latency for that function. See 静的閾値 for more information.

一貫したシグナルタイプ 🔗

ディテクターが適切に動作するためには、ディテクターが評価するシグナルは一貫したタイプの測定値を表すものである必要があります。例えば、Splunk Observability Cloudが cpu.utilization をレポートする場合、それは0から100の間の値であり、単一のLinuxインスタンスまたはホストの全CPUコアの平均使用率を表すものです。

ワイルドカードは使用しないでください。メトリクス名にワイルドカードを使用する場合は、そのワイルドカードによって異なるタイプのメトリクスが誤って含まれないようにしてください。たとえば、メトリクス名として jvm.* を入力した場合、ディテクターは同じ閾値に対して jvm.heapjvm.uptimejvm.cpu.load (それぞれ組織で使用されているメトリクス名であると仮定)を評価することができ、予期しない結果につながる可能性があります。

Viewing at native data resolution 🔗

A common and easy way to create a detector is to first create a chart, which lets you visualize the behavior of the signal you want to alert on, then convert it to a detector. See Create a detector from a chart to learn how. If you choose to use this method to create a detector, make sure you are visualizing the data at its native resolution, as this gives you the most accurate picture of the data that your detector evaluates. For example, if you create a detector using a metric that reports once every 10 seconds, make sure the time range for your chart is small enough (say, 15 minutes) to see individual measurements every 10 seconds.

By default, Splunk Observability Cloud chooses a chart display resolution that fits within the time range you choose, and summarizes the data to match that resolution. For example, if you use a metric that reports every 10 seconds, but you look at a 1-day window, then by default the data you see on the chart represents 30-minute intervals. Depending on the rollup or summarization method, this could mean that any peaks or dips average out, which gives you an inaccurate understanding of your signal and what constitutes an appropriate detector threshold. Also, analytics pipelines are applied to the rolled-up data, so the meaning of a calculation might change if the resolution changes. For example, duration parameters, which you can use for timeshifting and smoothing data, have no effect when they are smaller than the resolution.

Create detectors that monitor a single signal across a population 🔗

Splunk Observability Cloudは、特定のクラスター内の全ホストのCPU使用率のように多数の類似した項目を監視するディテクターを定義するためのシンプルで簡潔な方法を提供します。これは、メトリック時系列に関連付けられたメタデータによって実現されるもので、これは、そのメタデータ(ディメンション、プロパティ、タグ)がチャートを作成する方法に似ています。

Let’s look at an example. If you have a group of 30 hosts that provide a clustered service like Kafka, it normally includes a dimension like service:kafka with all of the metrics coming from those hosts. In this case, if you want to track whether CPU utilization remains below 70% for each of those hosts, you can create a single detector for the cpu.utilization metric that filters hosts using the service:kafka dimension and evaluates them against the static threshold of 70. This detector triggers individual alerts for each host whose CPU utilization exceeds the threshold - just as if you had 30 separate detectors - but you only need to create one detector, not 30.

In addition, if the population changes - say, because the cluster grows to 40 hosts - you do not need to make any changes to your detector. As long as you include the service:kafka dimension for metrics coming from the new hosts, the existing detector finds them and automatically includes them in the threshold evaluation.

Detectors that monitor a single signal work best when all of the members of the population have the same threshold, and the same notification policy. For example, they might publish alerts into the same Slack channel. If you have different thresholds or notification policies, you must create multiple detectors (one for each permutation of threshold and notification) or take advantage of the const function in SignalFlow. In any case, the likely number of such detectors is still fewer than the count of individual members that it monitors. It is important to create a detector for a signal, not for a microservice, in order to avoid accumulating too many detectors that trigger a multitude of alerts.

母集団内のサブグループを監視するために集計を使用する 🔗

You can also use detectors to monitor sub-groups within the population. For example, let’s say you have 100 hosts in total, divided among 10 services. You want to make sure the 95th percentile of CPU utilization across the cluster of hosts that provide each of those services remains below 70%. In this case, create a single detector for cpu.utilization, then apply an analytics function of P95, and group by service. The aggregation approach works only if service is a dimension or property. The aggregation approach does not work if service is a tag.

この集計ディテクターは、まるで10個の個別のディテクターを設定しているかのように各サービスに対してアラートをトリガーしますが、作成する必要があるのは10個でなく1個のディテクターだけです。サービスを追加した場合でも、その新しいサービスのメトリクスに service ディメンションまたはプロパティが含まれている限り、このディテクターは自動的にそれらのサービスを監視します。

また、「外れ値の検出」の内蔵アラート条件を使って、母集団の個々のメンバーを、母集団の標準からの偏差について、オプションでディメンションまたはプロパティをグループ化して、監視することができます。https://github.com/signalfx/signalflow-library/tree/master/library/signalfx/detectors/population_comparison で、GitHubのsignalflow-libraryにあるpopulation_comparisonディテクターについて参照してください。

このページは 2024年11月12日 に最終更新されました。