Troubleshoot inputs with metrics.log
This topic is an example of a problem you can solve using metrics.log.
- To learn about metrics.log, read "About metrics.log."
- To learn about other log files, read "What Splunk logs about itself."
Example: Troubleshoot data inputs
You might want to identify a data input that has suddenly begun to generate uncharacteristically large numbers of events. If this input is hidden in a large quantity of similar data, it can be difficult to determine which one is actually the problem. You can find it by searching the internal index (add index=_internal
to your search) or just look in metrics.log
itself in $SPLUNK_HOME/var/log/splunk
.
There's a lot more in metrics.log than just volume data, but for now let's focus on investigating data inputs.
For incoming events, the amount of data processed is in the thruput
group, as in per_host_thruput
. In this example, you're only indexing data from one host, so per_host_thruput
actually can tell us something useful: that right now host "grumpy" indexes around 8k in a 30-second period. Since there is only one host, you can add it all up and get a good picture of what you're indexing, but if you had more than 10 hosts you would only get a sample.
03-13-2008 10:49:57.634 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.245401, eps=1.774194, kb=7.607422 03-13-2008 10:50:28.642 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.237053, eps=1.612903, kb=7.348633 03-13-2008 10:50:59.648 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.217584, eps=1.548387, kb=6.745117 03-13-2008 10:51:30.656 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.245621, eps=1.741935, kb=7.614258 03-13-2008 10:52:01.661 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.311051, eps=2.290323, kb=9.642578 03-13-2008 10:52:32.669 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.296938, eps=2.322581, kb=9.205078 03-13-2008 10:53:03.677 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.261593, eps=1.838710, kb=8.109375 03-13-2008 10:53:34.686 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.263136, eps=2.032258, kb=8.157227 03-13-2008 10:54:05.692 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.261530, eps=1.806452, kb=8.107422 03-13-2008 10:54:36.699 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.313855, eps=2.354839, kb=9.729492
For example, you might know that access_common
is a popular source type for events on this Web server, so it would give you a good idea of what was happening:
03-13-2008 10:51:30.656 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.022587, eps=0.193548, kb=0.700195 03-13-2008 10:52:01.661 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.053585, eps=0.451613, kb=1.661133 03-13-2008 10:52:32.670 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.031786, eps=0.419355, kb=0.985352 03-13-2008 10:53:34.686 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.030998, eps=0.387097, kb=0.960938 03-13-2008 10:54:36.700 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.070092, eps=0.612903, kb=2.172852 03-13-2008 10:56:09.722 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.023564, eps=0.290323, kb=0.730469 03-13-2008 10:56:40.730 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.006048, eps=0.096774, kb=0.187500 03-13-2008 10:57:11.736 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.017578, eps=0.161290, kb=0.544922 03-13-2008 10:58:13.748 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.025611, eps=0.225806, kb=0.793945
But you probably have more than 10 source types, so at any particular time some other one could spike and access_common
wouldn't be reported. per_index_thruput
and per_source_thruput
work similarly.
With this in mind, let's examine the standard saved search "KB indexed per hour last 24 hours".
index=_internal metrics group=per_index_thruput NOT debug NOT sourcetype=splunk_web_access | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB
This means: look in the internal index for metrics data of group per_index_thruput
, ignore some internal stuff and make a report showing the sum of the kb values. For cleverness, we'll also rename the output to something meaningful, "totalKB". The result looks like this:
sum of kb vs. time for results in the past day _time totalKB 1 03/12/2008 11:00:00 922.466802 2 03/12/2008 12:00:00 1144.674811 3 03/12/2008 13:00:00 1074.541995 4 03/12/2008 14:00:00 2695.178730 5 03/12/2008 15:00:00 1032.747082 6 03/12/2008 16:00:00 898.662123
Those totalKB values just come from the sum of kb over a one hour interval. If you like, you can change the search and get just the ones from grumpy:
index=_internal metrics grumpy group=per_host_thruput | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB
sum of kb vs. time for results in the past day _time totalKB 1 03/12/2008 11:00:00 746.471681 2 03/12/2008 12:00:00 988.568358 3 03/12/2008 13:00:00 936.092772 4 03/12/2008 14:00:00 2529.226566 5 03/12/2008 15:00:00 914.945313 6 03/12/2008 16:00:00 825.353518
We see that grumpy was unusually active in the 2 pm time bin. With this knowledge, we can start to hunt down the culprit by, for example, source type or host.
Answers
Have questions? Visit Splunk Answers and see what questions and answers the Splunk community has about working with metrics.log.
About metrics.log | About access logs |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9, 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10, 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.1.13, 8.1.14, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 8.2.10, 8.2.11, 8.2.12, 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4, 9.0.5, 9.0.6, 9.0.7, 9.0.8, 9.0.9, 9.0.10, 9.1.0, 9.1.1, 9.1.2, 9.1.3, 9.1.4, 9.1.5, 9.1.6, 9.1.7, 9.2.0, 9.2.1, 9.2.2, 9.2.3, 9.2.4, 9.3.0, 9.3.1, 9.3.2
Feedback submitted, thanks!