Troubleshoot inputs with metrics.log

This topic is an example of a problem you can solve using metrics.log.

To learn about metrics.log, read "About metrics.log."
To learn about other log files, read "What Splunk logs about itself."

Example: Troubleshoot data inputs

You might want to identify a data input that has suddenly begun to generate uncharacteristically large numbers of events. If this input is hidden in a large quantity of similar data, it can be difficult to determine which one is actually the problem. You can find it by searching the internal index (add index=_internal to your search) or just look in metrics.log itself in $SPLUNK_HOME/var/log/splunk.

There's a lot more in metrics.log than just volume data, but for now let's focus on investigating data inputs.

For incoming events, the amount of data processed is in the thruput group, as in per_host_thruput. In this example, you're only indexing data from one host, so per_host_thruput actually can tell us something useful: that right now host "grumpy" indexes around 8k in a 30-second period. Since there is only one host, you can add it all up and get a good picture of what you're indexing, but if you had more than 10 hosts you would only get a sample.

03-13-2008 10:49:57.634 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.245401, eps=1.774194, kb=7.607422
03-13-2008 10:50:28.642 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.237053, eps=1.612903, kb=7.348633
03-13-2008 10:50:59.648 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.217584, eps=1.548387, kb=6.745117
03-13-2008 10:51:30.656 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.245621, eps=1.741935, kb=7.614258
03-13-2008 10:52:01.661 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.311051, eps=2.290323, kb=9.642578
03-13-2008 10:52:32.669 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.296938, eps=2.322581, kb=9.205078
03-13-2008 10:53:03.677 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.261593, eps=1.838710, kb=8.109375
03-13-2008 10:53:34.686 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.263136, eps=2.032258, kb=8.157227
03-13-2008 10:54:05.692 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.261530, eps=1.806452, kb=8.107422
03-13-2008 10:54:36.699 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.313855, eps=2.354839, kb=9.729492

For example, you might know that access_common is a popular source type for events on this Web server, so it would give you a good idea of what was happening:

03-13-2008 10:51:30.656 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.022587, eps=0.193548, kb=0.700195
03-13-2008 10:52:01.661 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.053585, eps=0.451613, kb=1.661133
03-13-2008 10:52:32.670 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.031786, eps=0.419355, kb=0.985352
03-13-2008 10:53:34.686 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.030998, eps=0.387097, kb=0.960938
03-13-2008 10:54:36.700 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.070092, eps=0.612903, kb=2.172852
03-13-2008 10:56:09.722 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.023564, eps=0.290323, kb=0.730469
03-13-2008 10:56:40.730 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.006048, eps=0.096774, kb=0.187500
03-13-2008 10:57:11.736 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.017578, eps=0.161290, kb=0.544922
03-13-2008 10:58:13.748 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.025611, eps=0.225806, kb=0.793945

But you probably have more than 10 source types, so at any particular time some other one could spike and access_common wouldn't be reported. per_index_thruput and per_source_thruput work similarly.

With this in mind, let's examine the standard saved search "KB indexed per hour last 24 hours".

index=_internal metrics group=per_index_thruput NOT debug NOT sourcetype=splunk_web_access | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB

This means: look in the internal index for metrics data of group per_index_thruput, ignore some internal stuff and make a report showing the sum of the kb values. For cleverness, we'll also rename the output to something meaningful, "totalKB". The result looks like this:

sum of kb vs. time for results in the past day
_time totalKB
1 03/12/2008 11:00:00 922.466802
2 03/12/2008 12:00:00 1144.674811
3 03/12/2008 13:00:00 1074.541995
4 03/12/2008 14:00:00 2695.178730
5 03/12/2008 15:00:00 1032.747082
6 03/12/2008 16:00:00 898.662123

Those totalKB values just come from the sum of kb over a one hour interval. If you like, you can change the search and get just the ones from grumpy:

index=_internal metrics grumpy group=per_host_thruput | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB

sum of kb vs. time for results in the past day
_time totalKB
1 03/12/2008 11:00:00 746.471681
2 03/12/2008 12:00:00 988.568358
3 03/12/2008 13:00:00 936.092772
4 03/12/2008 14:00:00 2529.226566
5 03/12/2008 15:00:00 914.945313
6 03/12/2008 16:00:00 825.353518

We see that grumpy was unusually active in the 2 pm time bin. With this knowledge, we can start to hunt down the culprit by, for example, source type or host.

Answers

Have questions? Visit Splunk Answers and see what questions and answers the Splunk community has about working with metrics.log.

Related answers from Splunk Community

Troubleshoot inputs with metrics.log

Example: Troubleshoot data inputs

Answers

Comments

Troubleshoot inputs with metrics.log

Was this topic useful?