This topic is an overview of metrics.log.
- To learn about other log files, read "What Splunk logs about itself."
- For an example using metrics.log, read "Troubleshoot inputs with metrics.log."
What information is in metrics.log?
Metrics.log has a variety of introspection information for reviewing Splunk's behavior. Let's look at what's available.
First, metrics.log is a periodic report of various Splunk behaviors. Essentially it's a snapshot, taken every 30 seconds or so, of recent goings-on.
By default, metrics.log reports the top 10 results for each type. You can change that number of series from the default by editing the value of
maxseries in the
[metrics] stanza in
Structure of the lines
Here is a sample line from metrics.log:
01-27-2010 15:43:54.913 INFO Metrics - group=pipeline, name=parsing, processor=utf8, cpu_seconds=0.000000, executes=66, cumulative_hits=301958
First, boiler plate: the timestamp, the "severity," which is always
INFO for metrics events, and then the kind of event, "
The next field is the "
group." This indicates what kind of metrics data it is. There are a few groups in the file, including:
Pipeline messages are reports on the Splunk pipelines, which are the strung-together pieces of "machinery" that process and manipulate events flowing into and out of the Splunk system. You can see how many times data reached a given machine in the Splunk system (executes), and you can see how much cpu time each machine used (cpu_seconds).
Plotting totals of cpu seconds by processor can show you where the cpu time is going in indexing activity. Looking at numbers for executes can give you an idea of data flow. For example if you see:
group=pipeline, name=merging, processor=aggregator, ..., executes=998 group=pipeline, name=merging, processor=readerin, ... , executes=698 group=pipeline, name=merging, processor=regexreplacement, ..., executes=698 group=pipeline, name=merging, processor=sendout, ... , executes=698
then it's pretty clear that a large portion of your items aren't making it past the aggregator. This might indicate that many of your events are multiline and are being combined in the aggregator before being passed along.
Read more about Splunk's data pipeline in "How data moves through Splunk" in the Distributed Deployment Manual.
Queue messages look like
... group=queue, name=parsingqueue, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
Most of these values aren't interesting. But current_size, especially considered in aggregate, across events, can tell you which portions of Splunk indexing are the bottlenecks. If current_size remains near zero, then probably the Splunk (indexing) system is not being taxed in any way. If the queues remain near 1000, then more data is being fed into the system (at the time) than it can process in total.
Sometimes you will see messages such as
... group=queue, name=parsingqueue, blocked!!=true, max_size=1000,
filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
This message contains the "blocked" string, indicating that it was full, and someone tried to add more, and couldn't. A queue will become unblocked as soon as the code pulling items out of it pulls an item. Many blocked queue messages in a sequence indicate that data is not flowing at all for some reason. A few scattered blocked messages indicate that flow control is operating, and is normal for a busy indexer.
If you want to look at the queue data in aggregate, graphing the average of current_size is probably a good starting point.
There are queues in place for data going into the parsing pipeline, and for data between parsing and indexing. Each networking output also has its own queue, which can be useful to determine whether the data is able to be sent promptly, or alternatively whether there's some network or receiving system limitation.
Generally, filled_count and empty_count cannot be productively used for inferences.
Thruput messages (similar to the English word "throughput") come in a few varieties.
Thruput is measured in the indexing pipeline. If your data is not reaching this pipeline for some reason, it will not appear in this data.
First there is a catchall line, which looks like this:
... group=thruput, name=index_thruput, instantaneous_kbps=0.287598, instantaneous_eps=1.000000, average_kbps=0.270838, total_k_processed=74197, load_average=1.345703
This is the best line to look at when tuning performance or evaluating indexing load. It tries to capture the total indexing data load.
Note: In thruput lingo, "kbps" does not mean kilobits per second, it means kilobytes per second. The industry standard term would be to write this something like KBps.
The most useful figure to look at in aggregate is probably
instantaneous_kbps over time.
Following the catchall can be variety of breakouts of the indexing thruput, lines like:
... group=per_host_thruput, series="jombook.splunk.com", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_index_thruput, series="_internal", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_source_thruput, series="/applications/splunk4/var/log/splunk/metrics.log", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_sourcetype_thruput, series="splunkd", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
In thruput messages the data load is broken out by host, index, source, and source type. This can be useful for answering two questions:
- Which data categories are busy?
- When were my data categories busy?
The series value identifies the host or index, etc. The kb value indicates the number of kilobytes processed since the last sample. Graphing kb in aggregate can be informative. The summary indexing status dashboard uses this data, for example.
evis a simple total count of events during the sampling period.
kbps, as before, indicates kilobytes per second averaged over the sampling period.
max_age refer to the difference between the time that the event was seen by the thruput processor in the indexing queue, and the time when the event occurred (or more accurately, the time that Splunk decided the event occurred).
max_ageis the largest difference between the current time and the perceived time of the events coming through the thruput processor.
avg_ageis the average difference between the current time and the perceived time of the events coming through the thruput processor.
per_x_thruput categories are not complete. Remember that by default metrics.log shows the 10 busiest of each type, for each sampling window. If you have 2000 active forwarders, you cannot expect to see the majority of them in this data. You can adjust the sampling quantity, but this will increase the chattiness of metrics.log and the resulting indexing load and _internal index size. The sampling quantity is adjustable in limits.conf,
[metrics] maxseries = num.
eps value, as it has accuracy issues.
udpin_connections lines are essentially metering on udp input.
group=udpin_connections, 2514, sourcePort=2514, _udp_bps=0.00, _udp_kbps=0.00, _udp_avg_thruput=0.00, _udp_kprocessed=0.00, _udp_eps=0.00
Some should be relatively self-explanatory.
- bps: bytes per second
- kbps: kilobytes per specond
- eps: events (packets) per second
- _udp_kprocessed is a running total of kilobytes processed since udp input processor start (typically since splunk start, but if reconfigured may reset).
Don't have info on avg_thruput at this time.
Be aware that it's quite achievable to max out the ability of the operating system, let alone Splunk, to handle UDP packets at high rates. This data might be useful to determine if any data is coming in at all, and at what times it rises. There is no guarantee that all packets sent to this port will be received and thus metered.
The mpool lines represent memory used by the Splunk indexer code only (not any other pipeline components). This information is probably not useful to anyone other than Splunk developers.
group=mpool, max_used_interval=4557, max_used=53878, avg_rsv=180, capacity=268435456, used=0
max_used_intervalrepresents the number of bytes used during the reporting interval (since the last output).
max_usedrepresents the maximum amount of memory, in bytes, in use at any time during the component's lifetime (most likely since last starting Splunk).
avg_rsvis the average size of a memory allocation across the run time of the system.
capacityis the limit on memory use for the indexer.
usedis the current indexer's current memory use.
In this case we can see that some memory is sometimes in use, although at the time of the sample, none is in use, and that generally the use is low.
map, pipelineinputchannel name messages
These messages are primarily debugging information over the Splunk internal cache of processing state and configuration data for a given data stream (host, source, or source type).
group=map, name=pipelineinputchannel, current_size=29, inactive_channels=4, new_channels=0, removed_channels=0, reclaimed_channels=0, timedout_channels=0, abandoned_channels=0
current_sizeis the number of total channels loaded in the system at the end of the sampling period.
inactive_channelsis the number of channels that have no entries in any pipeline referring to them (typically for recently seen data but not for data currently being processed) at the end of the sampling period.
new_channelsis the number of channels created during the sampling period, meaning that new data streams arrived, or a data stream that was aged out was created again.
removed_channelsis the number of channels destructed during the sampling period, which means that enough pressure existed to push these channels out of set (there were too many other new data streams).
reclaimed_channelsis the number of channels that were repurposed during the sampling period. This will happen for reasons similar to
new_channels, based on the size of the utilization, etc.
timedout_channelsis the number of channels that became unused for a long enough time to be considered stale and the information to be culled. Typically a timedout file or data source hasn't been producing data for some time.
abandoned_channelsis the number of channels that were terminated by Splunk forwarding, where the forwarder stopped communicating to an indexer, so the indexer shut them down.
Enable debug logging
Troubleshoot inputs with metrics.log
This documentation applies to the following versions of Splunk® Enterprise: 4.3, 4.3.1, 4.3.2, 4.3.3, 4.3.4, 4.3.5, 4.3.6, 4.3.7