Scenario: Combine aggregation and dropping rules to control your metric cardinality and volume 🔗
The following scenario features an example from Buttercup Games, a fictitious e-commerce company.
Background 🔗
Skyler is an admin for the central observability team at Buttercup Games. Skyler is in charge of monitoring observability usage across different teams to make sure they stay within the company’s budget.
Lately, Skyler notices a spike in their metrics usage. With the help of the Splunk Observability Cloud account team, Skyler obtains a detailed metrics usage report. The report gives Skyler insights into their metrics volume, high cardinality dimensions, usage of those metrics in charts and detectors, and distribution of metrics across different teams.
Skyler realizes that one team in particular is approaching their allocated usage limit. Skyler reaches out to Kai, the site reliability engineer (SRE) lead on that team, and asks them to optimize their team’s usage. Skyler shares with Kai the high cardinality metrics and their team’s usage.
Findings 🔗
The metrics usage report shows that Kai’s team sends about 50,000 metric time series (MTS) for the service.latency
metric to Splunk Observability Cloud, but not all the data at full granularity is essential. Kai looks at the report to understand more about the cardinality of different dimensions. They notice that the instance_id
and host_name
dimensions are the highest cardinality dimensions for service.latency
.
However, Kai knows their team cares most about different regions when it comes to service latency, so they only want to monitor the region
dimension. The instance_id
or host_name
dimensions are not information they need to monitor.
Actions 🔗
Kai decides to use metrics pipeline management to control how Splunk Observability Cloud ingests their team’s data.
In Splunk Observability Cloud, Kai creates an aggregation rule that reduces the cardinality of
service.latency
by keeping theregion
dimension and discardinginstance_id
andhost_name
.Kai has a new aggregated
service.latency_by_region
metric that yields only 1,623 MTS.Kai downloads the list of charts and detectors that use the
service.latency
metric.For each associated chart and detector, Kai replaces
service.latency
withservice.latency_by_region
.Kai lets Skyler know that they have created an aggregated metric and updated all the associated charts and detectors, so Skyler can drop the unaggregated raw metric that the team no longer needs to monitor.
Skyler selects
service.latency
on the Metrics pipeline management page to view current rules for the metric.Skyler changes Keep data to Drop data.
Skyler verifies the new metric volume after dropping the data they don’t need, and saves the rules.
Summary 🔗
By combining aggregation and data dropping rules, Kai and Skyler have successfully summarized a high cardinality metric, creating a more focused monitoring experience for their team while minimizing storage costs for Buttercup Games.
Learn more 🔗
To learn more, see the following docs: