About metrics.log

This topic is an overview of metrics.log.

To learn about other log files, read "What Splunk logs about itself."
For an example using metrics.log, read "Troubleshoot inputs with metrics.log."

What information is in metrics.log?

Metrics.log has a variety of introspection information for reviewing product behavior.

First, metrics.log is a periodic report, taken every 30 seconds or so, of recent Splunk software activity.

By default, metrics.log reports the top 10 results for each type. You can change that number of series from the default by editing the value of maxseries in the [metrics] stanza in limits.conf.

Structure of the lines

Here is a sample line from metrics.log:

01-27-2010 15:43:54.913 INFO Metrics - group=pipeline, name=parsing, processor=utf8, cpu_seconds=0.000000, executes=66, cumulative_hits=301958

First, boiler plate: the timestamp, the "severity," which is always INFO for metrics events, and then the kind of event, "Metrics."

The next field is the group. This indicates what kind of metrics data it is. There are a few groups in the file, including:

pipeline
queue
thruput
tcpout_connections
udpin_connections
mpool
per_host_regex_cpu
per_index_regex_cpu
per_source_regex_cpu
per_sourcetype_regex_cpu

Pipeline messages

Pipeline messages are reports on the Splunk pipelines, which are the strung-together pieces of "machinery" that process and manipulate events flowing into and out of the Splunk system. You can see how many times data reached a given machine in the Splunk system (executes), and you can see how much cpu time each machine used (cpu_seconds).

Plotting totals of cpu seconds by processor can show you where the cpu time is going in indexing activity. Looking at numbers for executes can give you an idea of data flow. For example if you see:

 group=pipeline, name=merging, processor=aggregator, ..., executes=998
 group=pipeline, name=merging, processor=readerin, ... , executes=698
 group=pipeline, name=merging, processor=regexreplacement, ..., executes=698
 group=pipeline, name=merging, processor=sendout, ... , executes=698

then it's pretty clear that a large portion of your items aren't making it past the aggregator. This might indicate that many of your events are multiline and are being combined in the aggregator before being passed along.

Read more about Splunk's data pipeline in "How data moves through Splunk" in the Distributed Deployment Manual.

Queue messages

Queue messages look like

... group=queue, name=parsingqueue, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0

Most of these values are not interesting. But current_size, especially considered in aggregate, across events, can tell you which portions of Splunk indexing are the bottlenecks. If current_size remains near zero, then probably the indexing system is not being taxed in any way. If the queues remain near 1000, then more data is being fed into the system (at the time) than it can process in total.

Sometimes you will see messages such as ... group=queue, name=parsingqueue, blocked!!=true, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0

This message contains the blocked string, indicating that it was full, and someone tried to add more, and couldn't. A queue becomes unblocked as soon as the code pulling items out of it pulls an item. Many blocked queue messages in a sequence indicate that data is not flowing at all for some reason. A few scattered blocked messages indicate that flow control is operating, and is normal for a busy indexer.

If you want to look at the queue data in aggregate, graphing the average of current_size is probably a good starting point.

There are queues in place for data going into the parsing pipeline, and for data between parsing and indexing. Each networking output also has its own queue, which can be useful to determine whether the data is able to be sent promptly, or alternatively whether there's some network or receiving system limitation.

Generally, filled_count and empty_count cannot be productively used for inferences.

Thruput messages

Thruput messages (similar to the English word "throughput") come in a few varieties.

Thruput is measured in the indexing pipeline. If your data is not reaching this pipeline for some reason, it will not appear in this data. Thruput numbers relate to the size of "raw" of the items flowing through the system, which is typically the chunks of the original text from the log sources. This differs from the tcpout measurements, which measure the total byte count written to sockets, including protocol overhead as well as descriptive information like host, source, source type.

First there is a catchall line, which looks like this:

... group=thruput, name=index_thruput, instantaneous_kbps=0.287598, instantaneous_eps=1.000000, average_kbps=0.270838, total_k_processed=74197, load_average=1.345703M

This is the best line to look at when tuning performance or evaluating indexing load. It tries to capture the total indexing data load.

Note: In thruput lingo, kbps does not mean kilobits per second, it means kilobytes per second. The industry standard term would be to write this something like KBps.

The average_kbps in the group=thruput catchall indicates the average since Splunk Enterprise started.

instantaneous_kbps indicates the average kbps for the reporting interval (equivalent to kbps for the breakouts below.)

The most useful figure to look at in aggregate is probably instantaneous_kbps over time.

Following the catchall, there can be variety of breakouts of the indexing thruput, including lines like:

... group=per_host_thruput, series="jombook.splunk.com", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241

... group=per_index_thruput, series="_internal", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
... group=per_source_thruput, series="/applications/splunk4/var/log/splunk/metrics.log", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
... group=per_sourcetype_thruput, series="splunkd", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241

In thruput messages the data load is broken out by host, index, source, and source type. This can be useful for answering two questions:

Which data categories are busy?
When were my data categories busy?

The series value identifies the host or index, etc. The kb value indicates the number of kilobytes processed since the last sample. Graphing kb in aggregate can be informative. The summary indexing status dashboard uses this data, for example.

ev is a simple total count of events during the sampling period.
kbps, as before, indicates kilobytes per second averaged over the sampling period.

The avg_age and max_age refer to the difference between the time that the event was seen by the thruput processor in the indexing queue, and the time when the event occurred (or more accurately, the time that Splunk decided the event occurred).

max_age is the largest difference, in seconds, between the current time and the perceived time of the events coming through the thruput processor.
avg_age is the average difference, in seconds, between the current time and the perceived time of the events coming through the thruput processor.

Note: The per_x_thruput categories are not complete. Remember that by default metrics.log shows the 10 busiest of each type, for each sampling window. If you have 2000 active forwarders, you cannot expect to see the majority of them in this data. You can adjust the sampling quantity, but this will increase the chattiness of metrics.log and the resulting indexing load and _internal index size. The sampling quantity is adjustable in limits.conf, [metrics] maxseries = num.

Ignore the eps value, as it has accuracy issues.

Tcpout Connections messages

Tcpout connections messages provide a variety of information about the systems that the tcpout component is currently connected to, via an open socket. One line is produced per connected system.

These lines look like this:

   ... group=tcpout_connections, name=undiag_indexers:10.159.4.67:9997:0, sourcePort=11089, 
   destIp=10.159.4.67, destPort=9997, _tcp_Bps=28339066.17, _tcp_KBps=27674.87,
   _tcp_avg_thruput=27674.87, _tcp_Kprocessed=802571, _tcp_eps=33161.10, kb=802571.21

name is a combination of conf stanza and entries that fully define the target system to the software.
sourcePort is the dynamically assigned port by the operating system for this socket.
destIP and destPort define the destination of the socket. destPort is typically statically configured, while destIP may be configured or returned by name resolution from the network.

All of the size-related fields are based on the number of bytes that Splunk has successfully written to the socket, or SSL-provided socket proxy. When SSL is not enabled for forwarding (the default), these numbers represent the number of bytes written to the socket, so effectively the number of bytes conveyed by the tcp transport (irrespective of overhead issues such as keepalive, tcp headers, ip headers, ethernet frames and so on). When SSL is enabled for forwarding, this number represents the number of bytes Splunk handed off to the OpenSSL layer.

_tcp_Bps is the bytes transmitted during the metrics interval divided by the duration of the interval (in seconds)
_tcp_KBps is the same value divided by 1024
_tcp_avg_thruput is an average rate of bytes sent since the last time the tcp output processor was reinitialized/reconfigured. Typically this means an average since Splunk started.
_tcp_KProcessed is the total number of bytes written since the processor was reinitialized/reconfigured, divided by 1024.
_tcp_eps is the number of items transmitted during the interval divided by the direction of the interval (in seconds). Note that items will frequently not be events for universal/light forwarders (instead, data chunks)
kb is the bytes transmitted during the metrics interval divided by 1024.

udpin messages

udpin_connections lines are essentially metering on udp input.

group=udpin_connections, 2514, sourcePort=2514, _udp_bps=0.00, _udp_kbps=0.00,
_udp_avg_thruput=0.00, _udp_kprocessed=0.00, _udp_eps=0.00

Some should be relatively self-explanatory.

bps: bytes per second
kbps: kilobytes per specond
eps: events (packets) per second
_udp_kprocessed is a running total of kilobytes processed since udp input processor start (typically since splunk start, but if reconfigured may reset).

Don't have info on avg_thruput at this time.

Be aware that it's quite achievable to max out the ability of the operating system, let alone Splunk, to handle UDP packets at high rates. This data might be useful to determine if any data is coming in at all, and at what times it rises. There is no guarantee that all packets sent to this port will be received and thus metered.

mpool messages

The mpool lines represent memory used by the Splunk indexer code only (not any other pipeline components). This information is probably not useful to anyone other than Splunk developers.

group=mpool, max_used_interval=4557, max_used=53878, avg_rsv=180, capacity=268435456, used=0

max_used_interval represents the number of bytes used during the reporting interval (since the last output).
max_used represents the maximum amount of memory, in bytes, in use at any time during the component's lifetime (most likely since last starting Splunk).
avg_rsv is the average size of a memory allocation across the run time of the system.
capacity is the limit on memory use for the indexer.
used is the current indexer's current memory use.

In this case we can see that some memory is sometimes in use, although at the time of the sample, none is in use, and that generally the use is low.

map, pipelineinputchannel name messages

These messages are primarily debugging information over the Splunk internal cache of processing state and configuration data for a given data stream (host, source, or source type).

group=map, name=pipelineinputchannel, current_size=29, inactive_channels=4,
new_channels=0, removed_channels=0, reclaimed_channels=0, timedout_channels=0,
abandoned_channels=0

current_size is the number of total channels loaded in the system at the end of the sampling period.
inactive_channels is the number of channels that have no entries in any pipeline referring to them (typically for recently seen data but not for data currently being processed) at the end of the sampling period.
new_channels is the number of channels created during the sampling period, meaning that new data streams arrived, or a data stream that was aged out was created again.
removed_channels is the number of channels destructed during the sampling period, which means that enough pressure existed to push these channels out of set (there were too many other new data streams).
reclaimed_channels is the number of channels that were repurposed during the sampling period. This will happen for reasons similar to new_channels, based on the size of the utilization, etc.
timedout_channels is the number of channels that became unused for a long enough time to be considered stale and the information to be culled. Typically a timedout file or data source hasn't been producing data for some time.
abandoned_channels is the number of channels that were terminated by Splunk forwarding, where the forwarder stopped communicating to an indexer, so the indexer shut them down.

subtask_seconds messages

Currently there is only one type of these messages, name=indexer, task=indexer_service. The message describes how the indexer has spent time over each interval window.

group=subtask_seconds, name=indexer, task=indexer_service, replicate_semislice=0.000000, throttle_optimize=0.00015,  flushBlockSig=0.000000, retryMove_1hotBkt=0.000000, size_hotBkt=0.000000, roll_hotBkt=0.000000, chillOrFreeze=0.000000, update_checksums=0.000000, fork_recovermetadata=0.000000, rebuild_metadata=0.000300, update_bktManifest=0.000000, service_volumes=0.000105, update_bktManifest=0.000000, service_volumes=0.000105, service_maxSizes=0.000000, service_externProc=0.000645

The throttle_optimize subtask represents time that the indexer spends waiting for splunk_optimize processes to reduce the count of .tsidx files to a reasonable level within hot buckets. Because splunk_optimize can in some cases run more slowly merging .tsidx files than the indexer runs while generating them, this flow-control state must exist for splunk_optimize to catch up. When this throttling occurs, messages are logged to splunkd.log in category DatabasePartitionPolicy similar to idx=<idxname> Throttling indexer, too many tsidx files in bucket.
The rebuild_metadata subtask represents time that the indexer spends generating new copies of .data files in hot buckets. This can mean writing out Hosts.data, Sourcetypes.data, Sources.data, or Strings.data. Typically the largest cost is for Strings.data in configurations where a large amount of data is directed to this file. Setting the MetaData log channel to DEBUG can provide more information on which buckets and files might be involved in a slowdown.

regular expression (regex) messages

These messages represent the amount of CPU time spent processing regular expressions on events, in seconds. The group field will identify if these metrics are based on host, index, source, or sourcetype. The CPU time metrics for the regex processor are not enabled by default. See the configuration regex_cpu_profiling in limits.conf in the Admin Manual.

Sample Event:

09-06-2018 20:00:54.667 +0000 INFO Metrics - group=per_sourcetype_regex_cpu, series="audittrail", cpu=428, cpupe=0.9749430523917996, bytes=138929, ev=439

Valid group names are:

per_host_regex_cpu
per_index_regex_cpu
per_source_regex_cpu
per_sourcetype_regex_cpu

For each interval, the following fields are produced.

cpu represents the CPU time in seconds for a given series.
cpupe represents the CPU time in seconds per event for a given series.
bytes represents the number of bytes processes for a given series.
ev represents total events for a given series.

The default interval for each measurement is 30 seconds.

Related answers from Splunk Community

About metrics.log

What information is in metrics.log?

Structure of the lines

Pipeline messages

Queue messages

Thruput messages

Tcpout Connections messages

udpin messages

mpool messages

map, pipelineinputchannel name messages

subtask_seconds messages

regular expression (regex) messages

Comments

About metrics.log

Was this topic useful?