Analyze error spans in Splunk APM đź”—
With Splunk APM error detection, you can isolate specific causes of errors in your system and applications.
Use these sections to answer the following questions you might have about error detection in Splunk APM:
How are error spans detected? đź”—
Each span in Splunk APM captures a single operation. Splunk APM considers a span an error span if the operation that the span captures results in an error.
A span is considered an error span when any of the following conditions are met:
The span’s
span.status
, set via OpenTelemetry instrumentation, isError
. See the OpenTelemetry Tracing API specification to learn more about thespan.status
.The span’s
error
tag is set to a truthy value, which is any value other thanFalse
or0
.The value of the span’s
http.status_code
tag is set to a5xx
error code. See How does Splunk APM handle HTTP status codes? to learn more.
Error counting in Splunk APM is based on the OpenTelemetry specification. See OpenTelemetry’s Tracing API specification on GitHub to learn more about the OpenTelemetry standard.
How does Splunk APM handle HTTP status codes? đź”—
The following table provides an overview of how HTTP status codes are treated in Splunk APM, in accordance with OpenTelemetry semantic conventions. To learn more, see the OpenTelemetry semantic conventions for HTTP spans on GitHub.
Error type |
Server-side spans ( |
Client-side spans ( |
---|---|---|
|
|
|
|
Not considered a server-side error; |
Counted as a client error; |
|
|
|
How are error spans counted in MetricSets? đź”—
To generate endpoint-level Monitoring MetricSets, Splunk APM turns endpoint spans, which are spans with span.kind = SERVER
or span.kind = CONSUMER
, into error metric data. If a span is considered an error per the Error rules in Splunk APM, that span counts towards errors in the Monitoring MetricSet for the endpoint associated with that span.
Service-level Monitoring MetricSets are based on the number of error spans in each of the service’s endpoints.
Server-side and client-side error counting đź”—
Splunk APM captures all spans from all instrumented services, including spans capturing requests made to clients (client-side spans) and requests received by services (server-side spans). In certain cases, when a service returns an error, the error can be registered in both the initiating span and the receiving span. To avoid duplicated error reports, Splunk APM counts only the server-side error spans in MetricSets and error totals.
For example, when service_a
makes a call to service_b
and both services are fully instrumented,Splunk APM receives the following two spans:
span_1
, a span withspan.kind = CLIENT
that capturesservice_a
making the call toservice_b
,span_2
, a span withspan.kind = SERVER
that capturesservice_b
receiving the request.
If service_b
returns a 500
error, both spans receive that error. To avoid double-counting, Splunk APM counts only the server-side span, span_2
, as an error in MetricSets and error totals.
What is the difference between an error and a root cause error? đź”—
To help you identify the root cause of an error, Splunk APM differentiates between errors and root cause errors. For instance, the request and error graph in Tag Spotlight differentiates root cause errors from total errors with a darker red color:

When a particular span (operation) within a trace results in an error, the error can propagate through other spans in the trace. Any span determined to contain an error based on the criteria described in How are error spans detected? is an error span. Splunk APM designates the originating error of a chain of error spans as the root cause error.
For instance, consider the checkout trace in the following screenshot:

The checkout
service makes HTTP requests to the authorization
service, the checkout
service, and the payment
service. The HTTP request to the payment
service results in a 402
“Payment Required” error. Because the request to the payment
service failed, the initiating requests to checkout
service and http.Request
also result in errors.
In this case, the source error, or root cause error, is the 402
error in the payment
service. The 500
errors appearing in the checkout
and api
services are subsequent errors.
The root cause error count indicates the count of these root cause errors, while the standard error count indicates the total count of all root cause errors as well as any subsequent errors.
How can you customize the error logic in Splunk APM? đź”—
In certain cases, you might want to modify your instrumentation to override defaults in the error logic or devise another method of tracking errors that matter to you.
Count 4xx
status codes as errors đź”—
By default, Splunk APM does not count server-side spans with 4xx
status codes as errors, because a 4xx
status code is often associated with a problem with the request itself, rather than a problem with the service handling a request.
For example, if a user makes a request to endpoint/that/does/not/exist
, the 404
status code the service returns does not mean there’s a problem with the service. Instead, it means there was a problem with the request, which is trying to call an endpoint that does not actually exist. Similarly, if a user tries to access a resource they don’t have access to, the service might return a 401
status code, which is typically not the result of an error on the server side.
However, depending on your application’s logic, a 4xx
status code might actually represent a meaningful error, particularly for client-side requests. To monitor for 4xx
errors, try doing the following:
Break down performance by HTTP status code span tags, if available. See Example use case: Alert on the rate of 401 errors for a service to learn more.
Customize your instrumentation to set the
span.status
of spans with meaningful4xx
status codes toError
.
Example use case: Alert on the rate of 401
errors for a service đź”—
For example, if Kai wanted to alert on the rate of 401
errors returned by a given service, they would do the following:
Index
http.status_code
. See Index span tags to generate Troubleshooting MetricSets.Create a custom Monitoring MetricSet on
http.status_code
for the service’s endpoints to get a time series for each status code. See Generate a Monitoring MetricSet with a custom dimension.Set up an alert on the rate of
401
errors as compared to all requests. See Configure detectors and alerts in Splunk APM.
Customize error logic to discard 5xx
status codes đź”—
By default, Splunk APM counts server-side spans with 5xx
status codes as errors, because a 5xx
error is typically associated with service unavailability.
For example, a 503: service too busy
error in a server-side span counts as an error by default. If the service you’re monitoring is the front-end of a public website, users encountering a 503 error are not able to use the website, thus potentially resulting in lost user interactions or lost revenue. In this case, a 503 would be a true error.
Depending on your application’s logic, however, you might not consider 5xx
codes to be meaningful errors. For example, if your service is a batch processor, a 503
can be a normal flow control mechanism, simply triggering clients to retry their requests later. To override the default that counts 503
status codes as errors, you could modify your instrumentation to set span.status
to OK
in the spans where a 503
error is not a concern.