Analyze error spans in Splunk APM 🔗
With Splunk APM error detection, you can isolate specific causes of errors in your system and applications.
Use these sections to answer the following questions you might have about error detection in Splunk APM:
What is the difference between an error and a root cause error?
How are error spans detected? 🔗
Each span in Splunk APM captures a single operation. Splunk APM considers a span an error span if the operation that the span captures results in an error.
A span is considered an error span when any of the following conditions are met:
span.status, set via OpenTelemetry instrumentation, is
Error. See the OpenTelemetry Tracing API specification to learn more about the
errortag is set to a truthy value, which is any value other than
The value of the span’s
http.status_codetag is set to a
5xxerror code. See How does Splunk APM handle HTTP status codes? to learn more.
Error counting in Splunk APM is based on the OpenTelemetry specification. See OpenTelemetry’s Tracing API specification on GitHub to learn more about the OpenTelemetry standard.
How does Splunk APM handle HTTP status codes? 🔗
The following table provides an overview of how HTTP status codes are treated in Splunk APM, in accordance with OpenTelemetry semantic conventions. To learn more, see the OpenTelemetry semantic conventions for HTTP spans on GitHub.
Server-side spans (
Client-side spans (
Not considered a server-side error;
Counted as a client error;
How are error spans counted in MetricSets? 🔗
To generate endpoint-level Monitoring MetricSets, Splunk APM turns endpoint spans, which are spans with
span.kind = SERVER or
span.kind = CONSUMER, into error metric data. If a span is considered an error per the Error rules in Splunk APM, that span counts towards errors in the Monitoring MetricSet for the endpoint associated with that span.
Service-level Monitoring MetricSets are based on the number of error spans in each of the service’s endpoints.
Server-side and client-side error counting 🔗
Splunk APM captures all spans from all instrumented services, including spans capturing requests made to clients (client-side spans) and requests received by services (server-side spans). In certain cases, when a service returns an error, the error can be registered in both the initiating span and the receiving span. To avoid duplicated error reports, Splunk APM counts only the server-side error spans in MetricSets and error totals.
For example, when
service_a makes a call to
service_b and both services are fully instrumented,Splunk APM receives the following two spans:
span_1, a span with
span.kind = CLIENTthat captures
service_amaking the call to
span_2, a span with
span.kind = SERVERthat captures
service_breceiving the request.
service_b returns a
500 error, both spans receive that error. To avoid double-counting, Splunk APM counts only the server-side span,
span_2, as an error in MetricSets and error totals.
What is the difference between an error and a root cause error? 🔗
To help you identify the root cause of an error, Splunk APM differentiates between errors and root cause errors. For instance, the request and error graph in Tag Spotlight differentiates root cause errors from total errors with a darker red color:
When a particular span (operation) within a trace results in an error, the error can propagate through other spans in the trace. Any span determined to contain an error based on the criteria described in How are error spans detected? is an error span. Splunk APM designates the originating error of a chain of error spans as the root cause error.
For instance, consider the checkout trace in the following screenshot:
checkout service makes HTTP requests to the
authorization service, the
checkout service, and the
payment service. The HTTP request to the
payment service results in a
402 “Payment Required” error. Because the request to the
payment service failed, the initiating requests to
checkout service and
http.Request also result in errors.
In this case, the source error, or root cause error, is the
402 error in the
payment service. The
500 errors appearing in the
api services are subsequent errors.
The root cause error count indicates the count of these root cause errors, while the standard error count indicates the total count of all root cause errors as well as any subsequent errors.
How can you customize the error logic in Splunk APM? 🔗
In certain cases, you might want to modify your instrumentation to override defaults in the error logic or devise another method of tracking errors that matter to you.
4xx status codes as errors 🔗
By default, Splunk APM does not count server-side spans with
4xx status codes as errors, because a
4xx status code is often associated with a problem with the request itself, rather than a problem with the service handling a request.
For example, if a user makes a request to
404 status code the service returns does not mean there’s a problem with the service. Instead, it means there was a problem with the request, which is trying to call an endpoint that does not actually exist. Similarly, if a user tries to access a resource they don’t have access to, the service might return a
401 status code, which is typically not the result of an error on the server side.
However, depending on your application’s logic, a
4xx status code might actually represent a meaningful error, particularly for client-side requests. To monitor for
4xx errors, try doing the following:
Break down performance by HTTP status code span tags, if available. See Example use case: Alert on the rate of 401 errors for a service to learn more.
Customize your instrumentation to set the
span.statusof spans with meaningful
4xxstatus codes to
Example use case: Alert on the rate of
401 errors for a service 🔗
For example, if Kai wanted to alert on the rate of
401 errors returned by a given service, they would do the following:
http.status_code. See Index span tags to generate Troubleshooting MetricSets.
Create a custom Monitoring MetricSet on
http.status_codefor the service’s endpoints to get a time series for each status code. See Generate a Monitoring MetricSet with a custom dimension.
Set up an alert on the rate of
401errors as compared to all requests. See Configure detectors and alerts in Splunk APM.
Customize error logic to discard
5xx status codes 🔗
By default, Splunk APM counts server-side spans with
5xx status codes as errors, because a
5xx error is typically associated with service unavailability.
For example, a
503: service too busy error in a server-side span counts as an error by default. If the service you’re monitoring is the front-end of a public website, users encountering a 503 error are not able to use the website, thus potentially resulting in lost user interactions or lost revenue. In this case, a 503 would be a true error.
Depending on your application’s logic, however, you might not consider
5xx codes to be meaningful errors. For example, if your service is a batch processor, a
503 can be a normal flow control mechanism, simply triggering clients to retry their requests later. To override the default that counts
503 status codes as errors, you could modify your instrumentation to set
OK in the spans where a
503 error is not a concern.