About metrics.log
This topic is an overview of metrics.log
.
- To learn about other log files, read "What Splunk logs about itself."
- For an example using
metrics.log
, read "Troubleshoot inputs with metrics.log."
What information is in metrics.log?
Metrics.log has a variety of introspection information for reviewing product behavior.
First, metrics.log
is a periodic report, taken every 30 seconds or so, of recent Splunk software activity.
By default, metrics.log
reports the top 10 results for each type. You can change that number of series from the default by editing the value of maxseries
in the [metrics]
stanza in limits.conf
.
Structure of the lines
Here is a sample line from metrics.log:
01-27-2010 15:43:54.913 INFO Metrics - group=pipeline, name=parsing, processor=utf8, cpu_seconds=0.000000, executes=66, cumulative_hits=301958
First, boiler plate: the timestamp, the "severity," which is always INFO
for metrics events, and then the kind of event, "Metrics
."
The next field is the group
. This indicates what kind of metrics data it is. There are a few groups in the file, including:
- pipeline
- queue
- thruput
- tcpout_connections
- udpin_connections
- mpool
- per_host_regex_cpu
- per_index_regex_cpu
- per_source_regex_cpu
- per_sourcetype_regex_cpu
Pipeline messages
Pipeline messages are reports on the Splunk pipelines, which are the strung-together pieces of "machinery" that process and manipulate events flowing into and out of the Splunk system. You can see how many times data reached a given machine in the Splunk system (executes
), and you can see how much cpu time each machine used (cpu_seconds
).
Plotting totals of cpu seconds by processor can show you where the cpu time is going in indexing activity. Looking at numbers for executes can give you an idea of data flow. For example if you see:
group=pipeline, name=merging, processor=aggregator, ..., executes=998 group=pipeline, name=merging, processor=readerin, ... , executes=698 group=pipeline, name=merging, processor=regexreplacement, ..., executes=698 group=pipeline, name=merging, processor=sendout, ... , executes=698
then it's pretty clear that a large portion of your items aren't making it past the aggregator. This might indicate that many of your events are multiline and are being combined in the aggregator before being passed along.
Read more about Splunk's data pipeline in "How data moves through Splunk" in the Distributed Deployment Manual.
Queue messages
Queue messages look like
... group=queue, name=parsingqueue, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
Most of these values are not interesting. But current_size
, especially considered in aggregate, across events, can tell you which portions of Splunk indexing are the bottlenecks. If current_size
remains near zero, then probably the indexing system is not being taxed in any way. If the queues remain near 1000, then more data is being fed into the system (at the time) than it can process in total.
Sometimes you will see messages such as
... group=queue, name=parsingqueue, blocked!!=true, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
This message contains the blocked
string, indicating that it was full, and someone tried to add more, and couldn't. A queue becomes unblocked as soon as the code pulling items out of it pulls an item. Many blocked queue messages in a sequence indicate that data is not flowing at all for some reason. A few scattered blocked messages indicate that flow control is operating, and is normal for a busy indexer.
If you want to look at the queue data in aggregate, graphing the average of current_size
is probably a good starting point.
There are queues in place for data going into the parsing pipeline, and for data between parsing and indexing. Each networking output also has its own queue, which can be useful to determine whether the data is able to be sent promptly, or alternatively whether there's some network or receiving system limitation.
Generally, filled_count
and empty_count
cannot be productively used for inferences.
Thruput messages
Thruput messages (similar to the English word "throughput") come in a few varieties.
Thruput is measured in the indexing pipeline. If your data is not reaching this pipeline for some reason, it will not appear in this data. Thruput numbers relate to the size of "raw" of the items flowing through the system, which is typically the chunks of the original text from the log sources. This differs from the tcpout
measurements, which measure the total byte count written to sockets, including protocol overhead as well as descriptive information like host, source, source type.
First there is a catchall line, which looks like this:
... group=thruput, name=index_thruput, instantaneous_kbps=0.287598, instantaneous_eps=1.000000, average_kbps=0.270838, total_k_processed=74197, load_average=1.345703M
This is the best line to look at when tuning performance or evaluating indexing load. It tries to capture the total indexing data load.
Note: In thruput lingo, kbps
does not mean kilobits per second, it means kilobytes per second. The industry standard term would be to write this something like KBps.
The average_kbps
in the group=thruput
catchall indicates the average since Splunk Enterprise started.
instantaneous_kbps
indicates the average kbps for the reporting interval (equivalent to kbps for the breakouts below.)
The most useful figure to look at in aggregate is probably instantaneous_kbps
over time.
Following the catchall, there can be variety of breakouts of the indexing thruput, including lines like:
... group=per_host_thruput, series="jombook.splunk.com", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
... group=per_index_thruput, series="_internal", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_source_thruput, series="/applications/splunk4/var/log/splunk/metrics.log", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_sourcetype_thruput, series="splunkd", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
In thruput messages the data load is broken out by host, index, source, and source type. This can be useful for answering two questions:
- Which data categories are busy?
- When were my data categories busy?
The series value identifies the host or index, etc. The kb value indicates the number of kilobytes processed since the last sample. Graphing kb in aggregate can be informative. The summary indexing status dashboard uses this data, for example.
ev
is a simple total count of events during the sampling period.kbps
, as before, indicates kilobytes per second averaged over the sampling period.
The avg_age
and max_age
refer to the difference between the time that the event was seen by the thruput processor in the indexing queue, and the time when the event occurred (or more accurately, the time that Splunk decided the event occurred).
max_age
is the largest difference, in seconds, between the current time and the perceived time of the events coming through the thruput processor.avg_age
is the average difference, in seconds, between the current time and the perceived time of the events coming through the thruput processor.
Note: The per_x_thruput
categories are not complete. Remember that by default metrics.log shows the 10 busiest of each type, for each sampling window. If you have 2000 active forwarders, you cannot expect to see the majority of them in this data. You can adjust the sampling quantity, but this will increase the chattiness of metrics.log and the resulting indexing load and _internal index size. The sampling quantity is adjustable in limits.conf, [metrics] maxseries = num
.
Ignore the eps
value, as it has accuracy issues.
Tcpout Connections messages
Tcpout connections messages provide a variety of information about the systems that the tcpout component is currently connected to, via an open socket. One line is produced per connected system.
These lines look like this:
... group=tcpout_connections, name=undiag_indexers:10.159.4.67:9997:0, sourcePort=11089, destIp=10.159.4.67, destPort=9997, _tcp_Bps=28339066.17, _tcp_KBps=27674.87, _tcp_avg_thruput=27674.87, _tcp_Kprocessed=802571, _tcp_eps=33161.10, kb=802571.21
name
is a combination of conf stanza and entries that fully define the target system to the software.sourcePort
is the dynamically assigned port by the operating system for this socket.destIP
anddestPort
define the destination of the socket.destPort
is typically statically configured, whiledestIP
may be configured or returned by name resolution from the network.
All of the size-related fields are based on the number of bytes that Splunk has successfully written to the socket, or SSL-provided socket proxy. When SSL is not enabled for forwarding (the default), these numbers represent the number of bytes written to the socket, so effectively the number of bytes conveyed by the tcp transport (irrespective of overhead issues such as keepalive, tcp headers, ip headers, ethernet frames and so on). When SSL is enabled for forwarding, this number represents the number of bytes Splunk handed off to the OpenSSL layer.
_tcp_Bps
is the bytes transmitted during the metrics interval divided by the duration of the interval (in seconds)_tcp_KBps
is the same value divided by 1024_tcp_avg_thruput
is an average rate of bytes sent since the last time the tcp output processor was reinitialized/reconfigured. Typically this means an average since Splunk started._tcp_KProcessed
is the total number of bytes written since the processor was reinitialized/reconfigured, divided by 1024._tcp_eps
is the number of items transmitted during the interval divided by the direction of the interval (in seconds). Note that items will frequently not be events for universal/light forwarders (instead, data chunks)kb
is the bytes transmitted during the metrics interval divided by 1024.
udpin messages
udpin_connections lines are essentially metering on udp input.
group=udpin_connections, 2514, sourcePort=2514, _udp_bps=0.00, _udp_kbps=0.00, _udp_avg_thruput=0.00, _udp_kprocessed=0.00, _udp_eps=0.00
Some should be relatively self-explanatory.
- bps: bytes per second
- kbps: kilobytes per specond
- eps: events (packets) per second
- _udp_kprocessed is a running total of kilobytes processed since udp input processor start (typically since splunk start, but if reconfigured may reset).
Don't have info on avg_thruput at this time.
Be aware that it's quite achievable to max out the ability of the operating system, let alone Splunk, to handle UDP packets at high rates. This data might be useful to determine if any data is coming in at all, and at what times it rises. There is no guarantee that all packets sent to this port will be received and thus metered.
mpool messages
The mpool lines represent memory used by the Splunk indexer code only (not any other pipeline components). This information is probably not useful to anyone other than Splunk developers.
group=mpool, max_used_interval=4557, max_used=53878, avg_rsv=180, capacity=268435456, used=0
max_used_interval
represents the number of bytes used during the reporting interval (since the last output).max_used
represents the maximum amount of memory, in bytes, in use at any time during the component's lifetime (most likely since last starting Splunk).avg_rsv
is the average size of a memory allocation across the run time of the system.capacity
is the limit on memory use for the indexer.used
is the current indexer's current memory use.
In this case we can see that some memory is sometimes in use, although at the time of the sample, none is in use, and that generally the use is low.
map, pipelineinputchannel name messages
These messages are primarily debugging information over the Splunk internal cache of processing state and configuration data for a given data stream (host, source, or source type).
group=map, name=pipelineinputchannel, current_size=29, inactive_channels=4, new_channels=0, removed_channels=0, reclaimed_channels=0, timedout_channels=0, abandoned_channels=0
current_size
is the number of total channels loaded in the system at the end of the sampling period.inactive_channels
is the number of channels that have no entries in any pipeline referring to them (typically for recently seen data but not for data currently being processed) at the end of the sampling period.new_channels
is the number of channels created during the sampling period, meaning that new data streams arrived, or a data stream that was aged out was created again.removed_channels
is the number of channels destructed during the sampling period, which means that enough pressure existed to push these channels out of set (there were too many other new data streams).reclaimed_channels
is the number of channels that were repurposed during the sampling period. This will happen for reasons similar tonew_channels
, based on the size of the utilization, etc.timedout_channels
is the number of channels that became unused for a long enough time to be considered stale and the information to be culled. Typically a timedout file or data source hasn't been producing data for some time.abandoned_channels
is the number of channels that were terminated by Splunk forwarding, where the forwarder stopped communicating to an indexer, so the indexer shut them down.
subtask_seconds messages
Currently there is only one type of these messages, name=indexer, task=indexer_service
. The message describes how the indexer has spent time over each interval window.
group=subtask_seconds, name=indexer, task=indexer_service, replicate_semislice=0.000000, throttle_optimize=0.00015, flushBlockSig=0.000000, retryMove_1hotBkt=0.000000, size_hotBkt=0.000000, roll_hotBkt=0.000000, chillOrFreeze=0.000000, update_checksums=0.000000, fork_recovermetadata=0.000000, rebuild_metadata=0.000300, update_bktManifest=0.000000, service_volumes=0.000105, update_bktManifest=0.000000, service_volumes=0.000105, service_maxSizes=0.000000, service_externProc=0.000645
- The
throttle_optimize
subtask represents time that the indexer spends waiting forsplunk_optimize
processes to reduce the count of.tsidx
files to a reasonable level within hot buckets. Becausesplunk_optimize
can in some cases run more slowly merging.tsidx
files than the indexer runs while generating them, this flow-control state must exist forsplunk_optimize
to catch up. When this throttling occurs, messages are logged tosplunkd.log
in categoryDatabasePartitionPolicy
similar toidx=<idxname> Throttling indexer, too many tsidx files in bucket
. - The
rebuild_metadata
subtask represents time that the indexer spends generating new copies of.data
files in hot buckets. This can mean writing outHosts.data
,Sourcetypes.data
,Sources.data
, orStrings.data
. Typically the largest cost is forStrings.data
in configurations where a large amount of data is directed to this file. Setting theMetaData
log channel to DEBUG can provide more information on which buckets and files might be involved in a slowdown.
regular expression (regex) messages
These messages represent the amount of CPU time spent processing regular expressions on events, in seconds. The group field will identify if these metrics are based on host, index, source, or sourcetype. The CPU time metrics for the regex processor are not enabled by default. See the configuration regex_cpu_profiling
in limits.conf in the Admin Manual.
Sample Event:
09-06-2018 20:00:54.667 +0000 INFO Metrics - group=per_sourcetype_regex_cpu, series="audittrail", cpu=428, cpupe=0.9749430523917996, bytes=138929, ev=439
Valid group names are:
per_host_regex_cpu
per_index_regex_cpu
per_source_regex_cpu
per_sourcetype_regex_cpu
For each interval, the following fields are produced.
cpu
represents the CPU time in seconds for a given series.cpupe
represents the CPU time in seconds per event for a given series.bytes
represents the number of bytes processes for a given series.ev
represents total events for a given series.
The default interval for each measurement is 30 seconds.
Enable debug logging | Troubleshoot inputs with metrics.log |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9, 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10, 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.1.13, 8.1.14, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 8.2.10, 8.2.11, 8.2.12, 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4, 9.0.5, 9.0.6, 9.0.7, 9.0.8, 9.0.9, 9.0.10, 9.1.0, 9.1.1, 9.1.2, 9.1.3, 9.1.4, 9.1.5, 9.1.6, 9.1.7, 9.2.0, 9.2.1, 9.2.2, 9.2.3, 9.2.4, 9.3.0, 9.3.1, 9.3.2, 9.4.0
Feedback submitted, thanks!