Use histogram metrics

Histograms are complex metric datatypes. A histogram data point defines a set of differently-sized data buckets. Its metrics answer the following questions about the distribution of measurements—such as request durations or response sizes—across those buckets at a given point in time.

How many measurements of a metric are less than or equal to the value that is the upper boundary of each data bucket? For example, at the given time, how many recorded request duration measurements fit into a bucket for request durations of 0.1 seconds or less? How many recorded request duration measurements fit into a bucket for request durations of 0.5 seconds or less? And so on.
What is the sum of all of the measurements that have been recorded for the metric?
What is the full count all of the measurements that have been recorded for the metric?

There are many things you can do with this information.

For a metric such as a request duration, you can use the count of request duration measurements and the sum of the measurement values to get the average request duration for a given range of time.
You can design alerts that monitor the measurement counts for specific buckets. For example, you could set up an alert that tells you if the number of requests served within 300ms drops below 95% of the total amount of requests.
You can estimate percentile values, such as the request duration within which you have served 95% of requests.

Histogram metrics before and after indexing

To use histogram metrics in the Splunk platform you need to ingest histogram-formatted metric data points from Prometheus or a similar metrics monitoring client using either the HTTP Event Collector or the Stream Processor Service.

Histogram data point structures are made up of several individual metric data points. A raw histogram data point that deals with HTTP request durations in seconds should look something like this before your Splunk platform implementation ingests it:

http_req_dur_sec_bucket{le="0.05",server="ronnie",endpoint="/"} 24054 1568073334000
http_req_dur_sec_bucket{le="0.1",server="ronnie",endpoint="/"} 33444 1568073334000
http_req_dur_sec_bucket{le="0.2",server="ronnie",endpoint="/"} 100392 1568073334000
http_req_dur_sec_bucket{le="0.5",server="ronnie",endpoint="/"} 129389 1568073334000
http_req_dur_sec_bucket{le="1",server="ronnie",endpoint="/"} 133988 1568073334000
http_req_dur_sec_bucket{le="+Inf",server="ronnie",endpoint="/"} 144320 1568073334000
http_req_dur_sec_sum{server="ronnie",endpoint="/"} 53423 1568073334000
http_req_dur_sec_count{server="ronnie",endpoint="/"} 144320 1568073334000

Each line of the histogram data point has this format:

<metric_name>{<dim0>=<dim_value0>,<dim1>=<dim_value1>,...,<dimN>=<dim_valueN>} <_value> <timestamp>

After ingestion, the histogram data point looks like this:

metric_name: http_req_dur_sec_bucket	metric_name: http_req_dur_sec_sum	metric_name: http_req_dur_sec_count	le	server	endpoint	timestamp (in seconds)
24054			0.05	ronnie	/	1568073334000
33444			0.1	ronnie	/	1568073334000
100392			0.2	ronnie	/	1568073334000
129389			0.5	ronnie	/	1568073334000
133988			1	ronnie	/	1568073334000
144320			+Inf	ronnie	/	1568073334000
	53423			ronnie	/	1568073334000
		144320		ronnie	/	1568073334000

Each individual metric in this table is a component of the overall histogram metric data point.

Anatomy of a histogram metric data point

Each histogram data point is composed of three types of metrics. Each metric type says something about the measurements that have been recorded for a specific metric as of the timestamp of the histogram. Each of these metrics is also an example of an accumulating counter metric, which means their values do not decrease over time. For more information about counter metrics see Investigate counter metrics.

Metrics type	Definition	Example
<metric_name>_bucket	Provides the count of measurements for a histogram bucket. The value of this field corresponds to the value of the `le` dimension, which sets the upper boundary value of the bucket.	In our example metric data point, 24,054 `http_req_dur_sec` measurements were less than or equal to `0.05` seconds as of the timestamp of the histogram data point. Meanwhile, 33,444 `http_req_dur_sec` measurements were less than or equal to `0.1` seconds.
<metric_name>_sum	Provides the sum of all of the `<metric_name>`measurements captured in this histogram data point.	In our example metric data point, the sum of all of the `http_req_dur_sec` measurements included in this metric data point is 53,243 seconds.
<metric_name>_count	Provides the count of all of the measurements captured in this data point.	Our example histogram data point represents the distribution of 144,320 `http_req_dur_sec` measurements.

Bucket count metrics and the bucket boundary dimension

There are usually several bucket count metrics in a histogram data point. They form a sequence of simultaneous measurement counts for buckets with larger and larger bucket boundaries, which are defined by the le dimension. The first bucket count only represents measurements with relatively small values. The next bucket count represents the measurements from the first count plus measurements with slightly larger values.

This sequence continues until the final bucket count, which corresponds with a le dimension value of +Inf. +Inf is shorthand for "Infinite." This means that the final bucket captures all of the measurements that were captured by the preceding bucket and any measurements that exceed the preceding bucket. The count for the +Inf <metric_name>_bucket should be equivalent to the value of the <metric_name>_count field. Both fields provide a count of all of the measurements categorized by the histogram data point as of the histogram's timestamp.

There are no le values for the <metric_name>_sum and <metric_name>_sum fields by design.

The Prometheus client requires that the bucket-boundary dimension be named le—the field name is an acronym for "less than or equal to"—but the Splunk software is flexible and can use a different name if le does not fit your needs.

Use histogram metrics in searches

Because histogram metric data points contain interconnected sets of counter metrics, you use the rate(x) function in conjunction with mstats to expose the bucket distribution for a given time span.

The _timeseries field is also essential. It lets you group by various dimension fields in commands that follow your rate(x) calculation. This enables you to carry out calculations similar to those that the Prometheus client allows, where every stats-like operation implicitly does something like by _timeseries when there is no explicit by clause. See Perform statistical calculations on metric time series.

Count and sum of measurements

Each histogram data point has a count of all of the measurements that have been recorded as of the histogram timestamp and a sum of the values of all of those measurements. You can use this information to calculate the average measurement during a given period of time. In the following example we use it to calculate the average request duration during the last five minutes.

| mstats rate(http_req_dur_sec_sum) as sum_req_duration where index="metrics" AND earliest="-5m" by _timeseries | appendcols [ | mstats rate(http_req_dur_secs_count) as num_req where index="metrics" AND earliest="-5m" by _timeseries ] | eval avg_req_duration=sum_req_duration / num_req | fields avg_req_duration

Alert on poor service rates

You can use histogram metrics to design an alert that is triggered when your http request service rate dips below a certain threshold. For example, say you have a service level agreement to serve 95% of http requests within 300ms. You can configure a histogram that has a bucket with an upper boundary of 0.3 seconds. The following search calculates the relative amount of requests served within 300ms by job within the last 5 minutes. You can use this search in the definition of an alert that triggers when the percent_requests_served for a job is less than 95.

| mstats rate(http_req_dur_sec_bucket) as bkt_req_per_sec where index="metrics" AND le=0.3 AND earliest="-5m" by _timeseries, job | stats sum(bkt_req_per_sec) as sum_bkt_req_per_sec by job | appendcols [ | mstats rate(http_req_dur_sec_count) as req_per_sec where index="metrics" AND earliest="-5m" by _timeseries, job | stats sum(req_per_sec) as sum_req_per_sec by job ] | eval percent_requests_served=sum_bkt_req_per_sec / sum_req_per_sec | fields job, percent_requests_served

Approximate Apdex scores

An Apdex score provides a numerical measure of user satisfaction with the performance of an application by calculating the ratio of satisfactory performance measurements to unsatisfactory performance measurements. You can use histogram metrics to approximate an Apdex score.

Let's say you want to do this for your http_req_dur_sec histogram. Start by configuring a bucket with the target request duration as its upper bound. Then configure another bucket to have the tolerated request duration—usually 4 times the target request duration—as the upper bound. For example, if the target request duration is 300 milliseconds, the tolerable request duration is 1.2 seconds.

The following expression yields the Apdex score for each job over the last 5 minutes:

| mstats rate(http_req_dur_sec_bucket) as bkt_req_per_sec_0.3 where index="metrics" AND le=0.3 AND earliest="-5m" by _timeseries, job | stats sum(bkt_req_per_sec_0.3) as sum_bkt_req_per_sec_0.3 by job | appendcols [ | mstats rate(http_req_dur_sec_bucket) as bkt_req_per_sec_1.2 where index="metrics" AND le=1.2 AND earliest="-5m" by _timeseries, job | stats sum(bkt_req_per_sec_1.2) as sum_bkt_req_per_sec_1.2 by job ] | appendcols [ | mstats rate(http_req_dur_sec_count) as req_per_sec where index="metrics" AND earliest="-5m" by _timeseries, job | stats sum(req_per_sec) as sum_req_per_sec by job ] | eval apdex_score=(sum_bkt_req_per_sec_0.3 + sum_bkt_req_per_sec_1.2) / 2 / sum_req_per_sec | fields job, apdex_score

This search divides the sum of both buckets because the the histogram buckets are cumulative. The le=0.3 bucket is contained in the le=1.2 bucket. Dividing it by 2 corrects for that.

The calculation does not exactly match the traditional Apdex score, as it includes errors in the satisfied and tolerable parts of the calculation.

Calculate percentile values with the histperc macro

The histperc macro enables you to calculate percentile values for your histogram metrics. This macro accounts for the bucket boundaries and the rate of increase of their counters, and estimates the value associated with the specified percentile based on some linear interpolation between histogram boundaries.

Say you have a histogram macro named http_req_dur_sec that provides the distribution of HTTP request duration measurements in terms of seconds. You could use the histperc macro to calculate the request duration within which you have served 95% of requests—otherwise known as the P95 value for your request service.

To do this you would set up a histperc macro. The histperc maco takes four arguments, the last of which is optional.

histperc(<perc>, <rate_field>, <bucket_upper_boundary_dimension> [, <groupby-fields>])

Argument	Description	Required?
perc	The desired percentile value. Must be between 0.0 and 1.0.	Yes
rate_field	The name of the field containing the output of the `mstats rate(x)` command. The histogram macro uses this output to generate the histogram distribution for some time period.	Yes
bucket_upper_boundary_dimension	The name of the dimension that represents the inclusive upper boundary of the buckets in the histogram data structure. Prometheus metrics use `le`, which stands for "less than or equal to".	Yes
groupby-fields	One or more dimensions to group by during the percentile calculation. Lists of fields must be quoted and comma-separated.	No

The three-argument version of histperc is listed on the Search Macros page in Settings as histperc(3). The four-argument version with the groupby-field argument is listed on the Search Macros page in Settings as histperc(4).

Histperc macro example

This search calculates the HTTP request duration within which you have served 99% of requests. It groups the results by _time for charting purposes.

| mstats rate(http_req_dur_sec_bucket) as requests_per_sec where index="metrics" by _timeseries, le span=5m | stats sum(requests_per_sec) as total_requests_per_sec by _time, le | `histperc(0.99, total_requests_per_sec, le, _time)`

About these examples

These examples are based on examples that Prometheus—an open-source metrics monitoring and alerting system—uses to illustrate their support for the histogram metric type.

Related answers from Splunk Community

Use histogram metrics

Histogram metrics before and after indexing

Anatomy of a histogram metric data point

Bucket count metrics and the bucket boundary dimension

Use histogram metrics in searches

Count and sum of measurements

Alert on poor service rates

Approximate Apdex scores

Calculate percentile values with the histperc macro

Histperc macro example

About these examples

Comments

Use histogram metrics

Was this topic useful?