Monitoring
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Monitoring
You can use Splunk to monitor your application deployment for availability, performance, and normalized status. In Splunk, it's not that big a step from troubleshooting to monitoring. You can take the same information you use to troubleshoot and create alerts, reports, and dashboards. You can track things that you know cause problems, like operating system degradation, web server errors or missing pages, or increases in transaction time. In addition, since the information you have in Splunk gives a deep picture of your log data over time, you can step back from problems and use Splunk to map and understand what is normal for your environment. Understanding what's typical helps you to build smarter alerts and manage your systems in a proactive manner.
Note: This topic is primarily an outline, which will be expanded over time. The outline and contents may change.
What to monitor
You can monitor using the same logs and inputs that you use for troubleshooting. But you may also find that you want to incorporate additional information to get a more holistic view of your application and its environment. For example, monitoring OS and device metrics may allow you to find the causes of performance degradation before it becomes a problem.
Sources to monitor that are specifically related to availability and performance include:
- Server uptimes: If you are forwarding logs from the servers in your application environment, you can alert when you do not receive logs from a particular server.
- Critical process monitoring: monitor whether your processes exist and their current CPU and memory usage.
- Web Services invocations: each time your application makes a call, the WSDL name of the operation is logged.
- JMS or MQ invocations: Asynchronous operations such as JMS or MQ invocations appear as SOAP messages in the logs.
- Number of services accessed/sec
- Response time from end to end (duration) using the delta command
- The number of services being used
Monitor performance and availability
The following are suggested methods for monitoring performance and availability with Splunk.
- Typical causes of problems:
- Application versioning (V2 shipped yesterday, initially performance was fine, then after 24 hours it tanked)
- Gradual application or platform problems (e.g., memory leak)
- Intermittent issues
- Configuration change (e.g., DB index change)
- Hardware issues (e.g., failed disk)
- Network problems (e.g., router unavailable)
- What do you want to monitor?
- Performance of each component, from the top down
- Changes to critical configuration files, which useful for forensics/root-cause-analysis
- Availability of entire app and also by component, such as 404/500 errors -- even if entire application is available, some pieces may not be
- Where to get info?
- Simulated transactions can help identify application availability and performance problems. These can be done via existing tools (e.g., HTTP ping) or using custom scripts. Simulated transaction data coupled with Splunk can help proactively identify availability and performance issues before they are visible to the end user.
- Logs
- Counters (varies by platform and app!)
- Unusual behavior (e.g., 50% of normal event count)
- Dependency check (e.g., test Oracle query)
- Ask the app (e.g., "/AppCheck.jsp")
- Suggestion: use all of the above!
- Cross-tier performance troubleshooting
- transactions
- Using summary indexing
- Reporting and proactive monitoring
- Examples of dashboards and how they're used in practice (include search text)
- Description and example of alerting
- SLA validation
- Spotting trends
- Measuring impact of changes (e.g., "V2 vs. V1")
- Correlating failures with other things that are happening
- Capacity planning
- Integration with other monitoring and performance management tools
Additional resources for monitoring availablity may come from dev, who often have an availability-centric tool or page you can leverage.
Monitor normalized status
Gathering information about the baseline performance of your application can help you create intelligent alerts. Splunk includes an extensive set of statistical operators that you can use to find average values over time, the most common or most rare values of a field, and so on.
Here are some ways to look for anomalies in your data:
- Search for unexpected events by looking at those that do not cluster into large groups. For example, you can cluster the errors in the last hour and report on the events that belong in the smallest clusters:
error | cluster showcount=true | sort - cluster_count | head 5 - Find unexpected events by finding values that are far from the standard deviation. For example, you can search for sendmail events with anomalous 'delay' values:
sourcetype=sendmail_syslog | anomalousvalue delay action=filter pthresh=0.02- Use machine learning to find events that have unexpected values based on past historical context
* | anomalies blacklist=boringevents- Use graphical reports to make anomalies visibly obvious. This is a little hard to show, but sometimes creating something as simple as a timechart of average cpu_seconds by host can help you see problems:
sourcetype=top | timechart avg(cpu_seconds) by host- Expand Splunk by defining your own search operators. If you know how to find events interesting to you, you can write a simple script and integrate it with Splunk.
This documentation applies to the following versions of Splunk: 4.1 , 4.1.1 , 4.1.2 , 4.1.3 , 4.1.4 , 4.1.5 , 4.1.6 , 4.1.7 , 4.1.8 View the Article History for its revisions.