Use case: Troubleshoot an issue from the browser to the back end using Splunk Observability Cloud 🔗
Buttercup Games, a fictitious company, runs an e-commerce site to sell its products. They recently refactored their site to use a cloud-native approach with a microservices architecture and Kubernetes for the infrastructure.
Buttercup Games site reliability engineers (SREs) and service owners work together to monitor and maintain the site to be sure that people have a great experience when they visit. One of the many reasons they decided to take a cloud-native approach is because it would make their system observable by tools. They implemented Splunk Observability Cloud as their observability solution.
This use case describes how Kai, an SRE, and Deepu, a service owner, performed the following tasks using Splunk Observability Cloud to troubleshoot and identify the root cause of a recent Buttercup Games site incident:
2. Assess user impact using Splunk Real User Monitoring (RUM)
3. Investigate the root cause of a business workflow error using Splunk Application Performance Monitoring (APM)
4. Check on infrastructure health using Splunk Infrastructure Monitoring
5. Look for patterns in application errors in Splunk APM
6. Examine error logs for meaningful messages and patterns using Splunk Log Observer
7. Monitor a fix using Splunk Log Observer
For a video version of this use case, see Splunk’s Observability Cloud Demo.
1. Receive alerts about outlier behavior 🔗
Kai, the SRE on-call, receives an alert showing that the number of purchases on the Buttercup Games site has dropped significantly in the past hour and the checkout completion rate is abnormally low.
Kai trusts that these are true outlier behaviors because the alert rule their team set up in Splunk Observability Cloud takes into account the time and day of week as a dynamic baseline, rather than simply using a static threshold.
Kai logs in to Splunk Observability Cloud on their laptop to investigate.
Learn more 🔗
For details about setting up detectors and alerts, see Introduction to alerts and detectors in Splunk Observability Cloud.
For details about integrating alerts with notification services, like Splunk On-Call, PagerDuty, and Jira, see Send alert notifications to third-party services using Splunk Observability Cloud.
For details about creating a team to coordinate work in Splunk Observability Cloud, see Create and manage teams.
2. Assess user impact 🔗
The first thing Kai wants to know about the alert they just received is: What’s the user impact?
Kai opens Splunk Real User Monitoring (RUM) to look for clues about the issue based on the site’s browser-based performance. They notice two issues that may be related to the drop in purchases and the low checkout completion rate:
Kai isn’t sure if the two issues are related or whether they are the cause of the problems on the site. They decide to dig into the high latency of the
/cart/checkout endpoint because the page load time and largest contentful paint for
cart/checkout are also high.
Kai clicks the /cart/checkout endpoint link and sees multiple errors in the Tag Spotlight view in Splunk RUM. The errors don’t seem to be related to any one tag in particular, so they click the Example Sessions tab to look at example sessions.
Kai sees one session that seems to be taking longer than the others. They click it to see the full trace, from the front end through to the back end, where they can see that it is taking longer to complete than normal. Based on this example data, Kai understands that the latency isn’t a front end problem and that they need to follow the trace through to the back end.
Kai clicks the APM link to get a performance summary, as well as access to the session’s trace and workflow details.
Kai decides to take a look at the end-to-end transaction workflow.
3. Investigate the root cause of a business workflow error 🔗
In Splunk RUM, Kai clicks the frontend:/cart/checkout business workflow link to display its service map in Splunk Application Performance Monitoring (APM). A business workflow is a group of logically related traces, such as a group of traces that reflect an end-to-end transaction in your system.
The service map shows Kai the dependency interactions among the full set of services backing the
/cart/checkout action that they’re troubleshooting, including the propagation of errors from one service to another.
In particular, Kai sees that the paymentservice is having issues. Splunk APM has identified the issues as root cause errors, meaning that the paymentservice has the highest number of downstream errors that are contributing to a degraded experience for the workflow.
Kai selects the paymentservice. In addition to displaying more details about the service’s errors and latency, Splunk Observability Cloud surfaces Related Content tiles that provide access to relevant data in other areas of the application.
For example, Kai can look at the health of the Kubernetes cluster where the paymentservice is running or examine logs being issued by the paymentservice.
Kai decides to take a look at the Kubernetes cluster to see if the errors are based on an infrastructure issue.
Learn more 🔗
For details about business workflows, see Correlate traces to track Business Workflows.
For details about using Related Content, see Enable Related Content in Splunk Observability Cloud.
For more Splunk APM-specific use case-based documentation, see Use cases: Troubleshoot errors and monitor application performance using Splunk APM.
4. Check on infrastructure health 🔗
Kai clicks the K8s cluster(s) for paymentservice Related Content tile in Splunk APM to display the Kubernetes navigator in Splunk Infrastructure Monitoring, where their view is automatically narrowed down to the paymentservice to preserve the context they were just looking at in Splunk APM.
They click the paymentservice pod in the cluster map to dive deeper into the data.
Kai sees that the pod looks stable with no errors or events.
Now that Kai can rule out the Kubernetes infrastructure as the source of the issue, they decide to return to their investigation in Splunk APM. Kai clicks the paymentservice in map Related Content tile in their current view of Splunk Infrastructure Monitoring.
5. Look for patterns in application errors 🔗
Now back in Splunk APM, Kai clicks into Tag Spotlight to look for correlations in tag values for the errors they’re seeing.
For example, when Kai looks at the tenant.level module, they see that errors are occurring for all levels, so the root cause is likely not tenant-specific.
However, when Kai looks at the :strong:version module, they see an interesting pattern: errors are happening on version v350.10 only and not on the earlier v350.9 version.
This seems like a strong lead, so Kai decides to dig into the log details. They click the Logs for paymentservice Related Content tile.
6. Examine error logs for meaningful messages and patterns 🔗
Now in Splunk Log Observer, Kai’s view is automatically narrowed to display log data coming in for the paymentservice only.
Kai sees some error logs, so they click one to see more details in a structured view.
As Kai looks at the log details, they see this error message: Failed payment processing through ButtercupPayments: Invalid API Token (test-20e26e90-356b-432e-a2c6-956fc03f5609).
In the error message, Kai sees what they think is a clear indication of the error. The API token starts with test. It seems that a team pushed v350.10 live with a test token that doesn’t work in production.
Just to double-check their hypothesis, Kai clicks the error message and selects Add to filter to show only the logs that contain this error message.
Next, Kai changes the Group by method from severity to version.
Now, Kai can see that all of the logs that contain this test API token error are on version v350.10 and none are on version v350.9.
Just to be sure, Kai clicks the “eye” icon for the message filter value to temporarily exclude the filter. Now there are logs that show up for version v350.9 too, but they don’t include the error message.
This exploration convinces Kai that the test API token in v350.10 is the most likely source of the issue. Kai notifies Deepu, the paymentservice owner about their findings.
7. Monitor a fix 🔗
Based on Kai’s findings, Deepu, the paymentservice owner, looks at the error logs in Splunk Log Observer. They agree with Kai’s assessment that the test API token is the likely cause of the problem.
Deepu decides to implement a temporary fix by reverting back to version v350.9 to try to bring the Buttercup Games site back into a known good state, while the team works on a fix to v350.10.
As one way to see if reverting to version v350.9 fixed the issue, Deepu opens the time picker in the upper left corner of Splunk Log Observer and selects Live Tail. Live Tail provides Deepu with a real-time streaming view of a sample of incoming logs.
Deepu watches the Live Tail view and sure enough, the failed payment messages have stopped appearing in paymentservice logs. Reassured that the Buttercup Games site is back in a stable state, Deepu moves on to helping their team fix v350.10.
Learn more 🔗
For details about using Splunk Log Observer Live Tail view, see Verify changes to monitored systems with Live Tail.
8. Take preventative action and create metrics from logs to power dashboards and alerts 🔗
Now that Kai knows that this particular issue can cause a problem on the Buttercup Games site, they decide to do some preventative work for their SRE team. Kai takes the query they created in Splunk Log Observer and saves it as a metric.
Doing this defines log metricization rules that create a log-derived metric that shows aggregate counts. Kai’s team can embed this log-derived metric in charts, dashboards, and alerts that can help them identify this issue faster should it come up again in the future.