Use histogram metrics
Histograms are complex metric datatypes. A histogram data point defines a set of differently-sized data buckets. Its metrics answer the following questions about the distribution of measurements—such as request durations or response sizes—across those buckets at a given point in time.
- How many measurements of a metric are less than or equal to the value that is the upper boundary of each data bucket? For example, at the given time, how many recorded request duration measurements fit into a bucket for request durations of 0.1 seconds or less? How many recorded request duration measurements fit into a bucket for request durations of 0.5 seconds or less? And so on.
- What is the sum of all of the measurements that have been recorded for the metric?
- What is the full count all of the measurements that have been recorded for the metric?
There are many things you can do with this information.
- For a metric such as a request duration, you can use the count of request duration measurements and the sum of the measurement values to get the average request duration for a given range of time.
- You can design alerts that monitor the measurement counts for specific buckets. For example, you could set up an alert that tells you if the number of requests served within 300ms drops below 95% of the total amount of requests.
- You can estimate percentile values, such as the request duration within which you have served 95% of requests.
Histogram metrics before and after indexing
To use histogram metrics in the Splunk platform you need to ingest histogram-formatted metric data points from Prometheus or a similar metrics monitoring client using the HTTP Event Collector or DSP.
A raw Prometheus histogram data point should look something like this before your Splunk platform implementation ingests it:
http_request_duration_seconds_bucket{le="0.05",server="ronnie",endpoint="/"} 24054 1568073334000 http_request_duration_seconds_bucket{le="0.1",server="ronnie",endpoint="/"} 33444 1568073334000 http_request_duration_seconds_bucket{le="0.2",server="ronnie",endpoint="/"} 100392 1568073334000 http_request_duration_seconds_bucket{le="0.5",server="ronnie",endpoint="/"} 129389 1568073334000 http_request_duration_seconds_bucket{le="1",server="ronnie",endpoint="/"} 133988 1568073334000 http_request_duration_seconds_bucket{le="+Inf",server="ronnie",endpoint="/"} 144320 1568073334000 http_request_duration_seconds_sum{server="ronnie",endpoint="/"} 53423 1568073334000 http_request_duration_seconds_count{server="ronnie",endpoint="/"} 144320 1568073334000
Each line of the Prometheus histogram data point has this format:
<metric_name>{<dim0>=<dim_value0>,<dim1>=<dim_value1>,...,<dimN>=<dim_valueN>} <_value> <timestamp>
After ingestion, the data point looks like this:
metric_name | _value | le | server | endpoint | timestamp (in seconds) |
---|---|---|---|---|---|
http_request_duration_seconds_bucket | 24054 | 0.05 | ronnie | / | 1568073334000 |
http_request_duration_seconds_bucket | 33444 | 0.1 | ronnie | / | 1568073334000 |
http_request_duration_seconds_bucket | 100392 | 0.2 | ronnie | / | 1568073334000 |
http_request_duration_seconds_bucket | 129389 | 0.5 | ronnie | / | 1568073334000 |
http_request_duration_seconds_bucket | 133988 | 1 | ronnie | / | 1568073334000 |
http_request_duration_seconds_bucket | 144320 | +Inf | ronnie | / | 1568073334000 |
http_request_duration_seconds_sum | 53423 | ronnie | / | 1568073334000 | |
http_request_duration_seconds_count | 144320 | ronnie | / | 1568073334000 |
Anatomy of a histogram metric data point
Each histogram data point has three types of metrics. Each metric type says something about the measurements that have been recorded for a specific metric as of the timestamp of the histogram. Each of these metrics is also an example of an accumulating counter metric, which means their values do not decrease over time. For more information about counter metrics see Investigate counter metrics.
Metrics type | Definition | Example |
---|---|---|
<metric_name>_bucket | Provides the count of measurements for a histogram bucket. The value of this field corresponds to the value of the le dimension, which sets the upper boundary value of the bucket.
|
In our example metric data point, 24,054 http_request_duration_seconds measurements were less than or equal to 0.05 seconds as of the timestamp of the histogram data point. Meanwhile, 33,444 http_request_duration_seconds measurements were less than or equal to 0.1 seconds.
|
<metric_name>_sum | Provides the sum of all of the <metric_name> measurements captured in this histogram data point.
|
In our example metric data point, the sum of all of the http_request_duration_seconds measurements included in this metric data point is 53,243 seconds.
|
<metric_name>_count | Provides the count of all of the measurements captured in this data point. | Our example histogram data point represents the distribution of 144,320 http_request_duration_seconds measurements.
|
Bucket count metrics and the bucket boundary dimension
There are usually several bucket count metrics in a histogram data point. They form a sequence of simultaneous measurement counts for buckets with larger and larger bucket boundaries, which are defined by the le
dimension. The first bucket count only represents measurements with relatively small values. The next bucket count represents the measurements from the first count plus measurements with slightly larger values.
This sequence continues until the final bucket count, which corresponds with a le
dimension value of +Inf
. +Inf
is shorthand for "Infinite." This means that the final bucket captures all of the measurements that were captured by the preceding bucket and any measurements that exceed the preceding bucket. The count for the +Inf
<metric_name>_bucket
should be equivalent to the value of the <metric_name>_count
field. Both fields provide a count of all of the measurements categorized by the histogram data point as of the histogram's timestamp.
There are no le
values for the <metric_name>_sum
and <metric_name>_sum
fields by design.
The Prometheus client requires that the bucket-boundary dimension be named le
—the field name is an acronym for "less than or equal to"—but the Splunk software is flexible and can use a different name if le
does not fit your needs.
Using histogram metrics in searches
Because histogram metric data points contain interconnected sets of counter metrics, you use the rate(x)
function in conjunction with mstats
to expose the bucket distribution for a given time span.
Count and sum of measurements
Each histogram data point has a count of all of the measurements that have been recorded as of the histogram timestamp and a sum of the values of all of those measurements. You can use this information to calculate the average measurement during a given period of time. In the following example we use it to calculate the average request duration during the last five minutes.
| mstats rate(http_request_duration_seconds_sum) as sum_req_duration WHERE index="metrics" AND earliest="-5m" BY [ | mcatalog values(_dims) as dims WHERE index="metrics" AND metric_name="http_request_duration_seconds_sum" | mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ]
| appendcols [
| mstats rate(http_request_duration_seconds_count) as num_req
WHERE index="metrics" AND earliest="-5m"
BY [ | mcatalog values(_dims) as dims WHERE index="metrics"
AND metric_name="http_request_duration_seconds_count"
| mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ]
]
| eval avg_req_duration=sum_req_duration / num_req
| fields avg_req_duration
Alert on poor service rates
You can use histogram metrics to design an alert that is triggered when your http request service rate dips below a certain threshold. For example, say you have a service level agreement to serve 95% of http requests within 300ms. You can configure a histogram that has a bucket with an upper boundary of 0.3 seconds. The following search calculates the relative amount of requests served within 300ms by job within the last 5 minutes. You can use this search in the definition of an alert that triggers when the percent_requests_served
for a job is less than 95.
Just two:
| mstats rate(http_request_duration_seconds_bucket) as bkt_req_per_sec
WHERE index="metrics" AND le=0.3 AND earliest="-5m"
BY [ | mcatalog values(_dims) as dims WHERE index="metrics"
AND metric_name="http_request_duration_seconds_bucket"
| mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ]
| stats sum(bkt_req_per_sec) as sum_bkt_req_per_sec by job
| appendcols [
| mstats rate(http_request_duration_seconds_count) as req_per_sec
WHERE index="metrics" AND earliest="-5m"
BY [ | mcatalog values(_dims) as dims WHERE index="metrics"
AND metric_name="http_request_duration_seconds_count"
| mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ]
| stats sum(req_per_sec) as sum_req_per_sec by job
]
| eval percent_requests_served=sum_bkt_req_per_sec / sum_req_per_sec
| fields job, percent_requests_served
Approximate Apdex scores
An Apdex score provides a numerical measure of user satisfaction with the performance of an application by calculating the ratio of satisfactory performance measurements to unsatisfactory performance measurements. You can use histogram metrics to approximate an Apdex score.
Let's say you want to do this for your http_request_duration_seconds
histogram. Start by configuring a bucket with the target request duration as its upper bound. Then configure another bucket to have the tolerated request duration—usually 4 times the target request duration—as the upper bound. For example, if the target request duration is 300 milliseconds, the tolerable request duration is 1.2 seconds.
The following expression yields the Apdex score for each job over the last 5 minutes:
| mstats rate(http_request_duration_seconds_bucket) as bkt_req_per_sec_0.3
WHERE index="metrics" AND le=0.3 AND earliest="-5m"
BY [ | mcatalog values(_dims) as dims WHERE index="metrics"
AND metric_name="http_request_duration_seconds_bucket"
| mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ]
| stats sum(bkt_req_per_sec_0.3) as sum_bkt_req_per_sec_0.3 by job
| appendcols [
| mstats rate(http_request_duration_seconds_bucket) as bkt_req_per_sec_1.2
WHERE index="metrics" AND le=1.2 AND earliest="-5m"
BY [ | mcatalog values(_dims) as dims WHERE index="metrics"
AND metric_name="http_request_duration_seconds_bucket"
| mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ]
| stats sum(bkt_req_per_sec_1.2) as sum_bkt_req_per_sec_1.2 by job
]
| appendcols [
| mstats rate(http_request_duration_seconds_count) as req_per_sec
WHERE index="metrics" AND earliest="-5m"
BY [ | mcatalog values(_dims) as dims WHERE index="metrics"
AND metric_name="http_request_duration_seconds_count"
| mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ]
| stats sum(req_per_sec) as sum_req_per_sec by job
]
| eval apdex_score=(sum_bkt_req_per_sec_0.3 + sum_bkt_req_per_sec_1.2) / 2 / sum_req_per_sec
| fields job, apdex_score
This search divides the sum of both buckets because the the histogram buckets are cumulative. The le=0.3
bucket is contained in the le=1.2
bucket. Dividing it by 2 corrects for that.
The calculation does not exactly match the traditional Apdex score, as it includes errors in the satisfied and tolerable parts of the calculation.
Calculate percentile values with the histperc macro
The histperc macro enables you to calculate percentile values for your histogram metrics. This macro accounts for the bucket boundaries and the rate of increase of their counters, and estimates the value associated with the specified percentile based on some linear interpolation between histogram boundaries.
Say you have a histogram macro named http_request_duration_seconds
that provides the distribution of HTTP request duration measurements. You could use the histperc macro to calculate the request duration within which you have served 95% of requests—otherwise known as the P95 value for your request service.
To do this you would set up a histperc macro. The histperc maco takes four arguments, the last of which is optional.
histperc(<perc>, <rate_field>, <bucket_upper_boundary_dimension> [, <groupby-fields>])
Argument | Description | Required? |
---|---|---|
perc | The desired percentile value. Must be between 0.0 and 1.0. | Yes |
rate_field | The name of the field containing the output of the mstats rate(x) command. The histogram macro uses this output to generate the histogram distribution for some time period.
|
Yes |
bucket_upper_boundary_dimension | The name of the dimension that represents the inclusive upper boundary of the buckets in the histogram data structure. Prometheus metrics use le , which stands for "less than or equal to".
|
Yes |
groupby-fields | One or more dimensions to group by during the percentile calculation. Lists of fields must be quoted and comma-separated. | No |
The three-argument version of histperc is listed on the Search Macros page in Settings as histperc(3). The four-argument version with the groupby-field argument is listed on the Search Macros page in Settings as histperc(4).
Histperc macro example
This search calculates the HTTP request duration within which you have served 99% of requests. It groups the results by _time
for charting purposes.
| mstats rate(http_request_duration_seconds_bucket) as requests_per_sec WHERE index="metrics"
BY [ | mcatalog values(_dims) as dims
WHERE index="metrics" AND metric_name="http_request_duration_seconds_bucket"
| mvexpand dims
| mvcombine delim=" " dims
| nomv dims
| rename dims as search ] span=5m
| stats sum(requests_per_sec) as total_requests_per_sec by _time, le
| `histperc(0.99, total_requests_per_sec, le, _time)`
About these examples
These examples are based on examples that Prometheus—an open-source metrics monitoring and alerting system—uses to illustrate their support for the histogram metric type.
This documentation applies to the following versions of Splunk® Enterprise: 7.2.8, 7.2.9, 7.2.10, 7.3.2, 7.3.3, 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9
Feedback submitted, thanks!