Find the root cause of an error using Tag Spotlight 🔗
Deepu, the payment service owner at Buttercup Games, receives an alert that the service error rate of payment service goes above the threshold set for a Splunk APM detector. After checking the infrastructure and finding no problem, Deepu reviews the service’s Request/Errors rate and Latency generated by Splunk APM on the alert.
In the meantime, Deepu receives a notification from Kai, the site reliability engineer. The notification says that the high root cause error rate with the
/PaymentService/Charge endpoint is preventing customers from shopping on the Buttercup Games website. The notification also includes a link to the endpoint on the Splunk APM service map. Deepu clicks the paymentservice node on the service map and selects the Tag Spotlight view on the sidebar:
Scanning through requests and errors correlated with each indexed tag, Deepu sees that the errors are evenly distributed for all tag values except the version tag. All errors occur in version 350.10, a recent code release to the service. Deepu rolls back to the previous release, version 350.9, to keep the site running while notifying and waiting for the engineers to solve the issue.
Deepu narrows the investigation to the code in version 350.10 of the
/PaymentService/Charge endpoint and clicks the Request/Errors chart to get an example trace to see what the error is. Because Deepu enabled Related Content in Splunk APM, Deepu can click Logs for trace to switch to Splunk Log Observer for further troubleshooting.
For details about how to configure Splunk APM detectors, see Configure detectors and alerts in Splunk APM.
For details about Tag Spotlight, see Analyze service performance with Tag Spotlight in Splunk Observability Cloud.
For details about using Related Content, see Enable Related Content in Splunk Observability Cloud.
For more information about using Splunk Log Observer to detect the source of problems, see Introduction to Splunk Log Observer.