Work with metrics.log
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Work with metrics.log
Splunk's internal metrics.log file is a sampling of the top ten items in each category for 30-second intervals, based on the size of _raw. It does not give you an exact accounting of all your inputs, just the top 10 hot data sources. You can examine its contents for limited analysis of volume trends for data inputs. It is different from the numbers reported by LicenseManager, which include the indexed fields. Also, the default configuration only maintains the metrics data in the internal index a few days, but by going to the files you can see trends over a period of months if your rolled files go that far back.
Note: You can change the number of series that metrics.log tracks from the default of 10 by editing the value of maxseries in the [metrics] stanza in limits.conf.
You can find more information about metrics.log in this developer blog posting about forwarder and indexer metrics and in this more recent post on the Splunk Community Wiki.
Use metrics log to troubleshoot issues with data inputs
You might want to identify a data input that has suddenly begun to generate uncharacteristically large numbers of events. If this input is hidden in a large quantity of similar data, it can be difficult to determine which one is actually the problem. You can find it by searching the internal index (add index=_internal to your search) or just look in metrics.log itself in $SPLUNK_HOME/var/log/splunk.
A typical metrics.log has stuff like this:
03-13-2008 10:48:55.620 INFO Metrics - group=pipeline, name=tail, processor=tail, cpu_seconds=0.000000, executes=31, cumulative_hits=73399 03-13-2008 10:48:55.620 INFO Metrics - group=pipeline, name=typing, processor=annotator, cpu_seconds=0.000000, executes=63, cumulative_hits=134912 03-13-2008 10:48:55.620 INFO Metrics - group=pipeline, name=typing, processor=clusterer, cpu_seconds=0.000000, executes=63, cumulative_hits=134912 03-13-2008 10:48:55.620 INFO Metrics - group=pipeline, name=typing, processor=readerin, cpu_seconds=0.000000, executes=63, cumulative_hits=134912 03-13-2008 10:48:55.620 INFO Metrics - group=pipeline, name=typing, processor=sendout, cpu_seconds=0.000000, executes=63, cumulative_hits=134912 03-13-2008 10:48:55.620 INFO Metrics - group=thruput, name=index_thruput, instantaneous_kbps=0.302766, instantaneous_eps=2.129032, average_kbps=0.000000, total_k_processed=19757, load_average=0.124023 03-13-2008 10:48:55.620 INFO Metrics - group=per_host_thruput, series="fthost", kbps=0.019563, eps=0.096774, kb=0.606445 03-13-2008 10:48:55.620 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.283203, eps=2.032258, kb=8.779297 03-13-2008 10:48:55.620 INFO Metrics - group=per_index_thruput, series="_internal", kbps=0.275328, eps=1.903226, kb=8.535156 03-13-2008 10:48:55.620 INFO Metrics - group=per_index_thruput, series="_thefishbucket", kbps=0.019563, eps=0.096774, kb=0.606445 03-13-2008 10:48:55.620 INFO Metrics - group=per_index_thruput, series="default", kbps=0.007876, eps=0.129032, kb=0.244141 03-13-2008 10:48:55.620 INFO Metrics - group=per_source_thruput, series="/applications/splunk3.2/var/log/splunk/metrics.log", kbps=0.272114, eps=1.870968, kb=8.435547 03-13-2008 10:48:55.620 INFO Metrics - group=per_source_thruput, series="/applications/splunk3.2/var/log/splunk/splunkd.log", kbps=0.003213, eps=0.032258, kb=0.099609 03-13-2008 10:48:55.620 INFO Metrics - group=per_source_thruput, series="/var/log/apache2/somedomain_access_log", kbps=0.007876, eps=0.096774, kb=0.244141 03-13-2008 10:48:55.620 INFO Metrics - group=per_source_thruput, series="filetracker", kbps=0.019563, eps=0.096774, kb=0.606445 03-13-2008 10:48:55.620 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.007876, eps=0.129032, kb=0.244141 03-13-2008 10:48:55.620 INFO Metrics - group=per_sourcetype_thruput, series="filetrackercrclog", kbps=0.019563, eps=0.096774, kb=0.606445 03-13-2008 10:48:55.620 INFO Metrics - group=per_sourcetype_thruput, series="splunkd", kbps=0.275328, eps=1.903226, kb=8.535156 03-13-2008 10:48:55.620 INFO Metrics - group=queue, name=aeq, max_size=10, filled_count=0, empty_count=0, current_size=0, largest_size=0, smallest_size=0 03-13-2008 10:48:55.620 INFO Metrics - group=queue, name=aq, max_size=10, filled_count=0, empty_count=0, current_size=0, largest_size=0, smallest_size=0 03-13-2008 10:48:55.620 INFO Metrics - group=queue, name=tailingq, current_size=0, largest_size=0, smallest_size=0 03-13-2008 10:48:55.620 INFO Metrics - group=queue, name=udp_queue, max_size=1000, filled_count=0, empty_count=0, current_size=0, largest_size=0, smallest_size=0
There's a lot more there than just volume data, but for now let's focus on investigating data inputs.
group identifies what type of thing is being reported on and series gives the particular item.
For incoming events, the amount of data processed is in the thruput group, as in per_host_thruput. In this example, you're only indexing data from one host, so per_host_thruput actually can tell us something useful: that right now host "grumpy" indexes around 8k in a 30-second period. Since there is only one host, you can add it all up and get a good picture of what you're indexing, but if you had more than 10 hosts you would only get a sample.
03-13-2008 10:49:57.634 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.245401, eps=1.774194, kb=7.607422 03-13-2008 10:50:28.642 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.237053, eps=1.612903, kb=7.348633 03-13-2008 10:50:59.648 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.217584, eps=1.548387, kb=6.745117 03-13-2008 10:51:30.656 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.245621, eps=1.741935, kb=7.614258 03-13-2008 10:52:01.661 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.311051, eps=2.290323, kb=9.642578 03-13-2008 10:52:32.669 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.296938, eps=2.322581, kb=9.205078 03-13-2008 10:53:03.677 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.261593, eps=1.838710, kb=8.109375 03-13-2008 10:53:34.686 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.263136, eps=2.032258, kb=8.157227 03-13-2008 10:54:05.692 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.261530, eps=1.806452, kb=8.107422 03-13-2008 10:54:36.699 INFO Metrics - group=per_host_thruput, series="grumpy", kbps=0.313855, eps=2.354839, kb=9.729492
For example, you might know that access_common is a popular sourcetype for events on this Web server, so it would give you a good idea of what was happening:
03-13-2008 10:51:30.656 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.022587, eps=0.193548, kb=0.700195 03-13-2008 10:52:01.661 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.053585, eps=0.451613, kb=1.661133 03-13-2008 10:52:32.670 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.031786, eps=0.419355, kb=0.985352 03-13-2008 10:53:34.686 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.030998, eps=0.387097, kb=0.960938 03-13-2008 10:54:36.700 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.070092, eps=0.612903, kb=2.172852 03-13-2008 10:56:09.722 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.023564, eps=0.290323, kb=0.730469 03-13-2008 10:56:40.730 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.006048, eps=0.096774, kb=0.187500 03-13-2008 10:57:11.736 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.017578, eps=0.161290, kb=0.544922 03-13-2008 10:58:13.748 INFO Metrics - group=per_sourcetype_thruput, series="access_common", kbps=0.025611, eps=0.225806, kb=0.793945
But you have probably got more than 10 sourcetypes, so at any particular time some other one could spike and access_common wouldn't be reported. per_index_thruput and per_source_thruput work similarly.
With this in mind, let's examine the standard saved search "KB indexed per hour last 24 hours".
index::_internal metrics group=per_index_thruput NOT debug NOT sourcetype::splunk_web_access | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB
This means: look in the internal index for metrics data of group per_index_thruput, ignore some internal stuff and make a report showing the sum of the kb values. For cleverness, we'll also rename the output to something meaningful, "totalKB". The result looks like this:
sum of kb vs. time for results in the past day _time totalKB 1 03/12/2008 11:00:00 922.466802 2 03/12/2008 12:00:00 1144.674811 3 03/12/2008 13:00:00 1074.541995 4 03/12/2008 14:00:00 2695.178730 5 03/12/2008 15:00:00 1032.747082 6 03/12/2008 16:00:00 898.662123
Those totalKB values just come from the sum of kb over a one hour interval. If you like, you can change the search and get just the ones from grumpy:
index::_internal metrics grumpy group=per_host_thruput | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB sum of kb vs. time for results in the past day _time totalKB 1 03/12/2008 11:00:00 746.471681 2 03/12/2008 12:00:00 988.568358 3 03/12/2008 13:00:00 936.092772 4 03/12/2008 14:00:00 2529.226566 5 03/12/2008 15:00:00 914.945313 6 03/12/2008 16:00:00 825.353518 index::_internal metrics access_common group=per_sourcetype_thruput | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB sum of kb vs. time for results in the past day _time totalKB 1 03/12/2008 11:00:00 65.696285 2 03/12/2008 12:00:00 112.035162 3 03/12/2008 13:00:00 59.775395 4 03/12/2008 14:00:00 35.008788 5 03/12/2008 15:00:00 62.478514 6 03/12/2008 16:00:00 14.173828
Answers
Have questions? Visit Splunk Answers and see what questions and answers the Splunk community has about working with metrics.log.
This documentation applies to the following versions of Splunk: 4.1 , 4.1.1 , 4.1.2 , 4.1.3 , 4.1.4 , 4.1.5 , 4.1.6 , 4.1.7 , 4.1.8 , 4.2 , 4.2.1 , 4.2.2 View the Article History for its revisions.