Event indexing delay
Symptoms
Events collected from a forwarder or from a log file are not yet searchable on Splunk. Even though the time stamps of the events are within the search time range, a search does not return the events. Later, a search over the same time range returns the events.
Diagnosis
Quantify the problem by measuring how long your Splunk deployment is taking to make your data searchable.
To measure the delay between the time stamp of the events and the indexing time (the time that the indexer receives and processes the events), use the following method:
1. Look at the delay in seconds per host.
source=mysource | eval delay_sec=_indextime-_time | timechart min(delay_sec) avg(delay_sec) max(delay_sec) by host
2. Look at the delay in seconds per source for a particular host.
source=mysource host=myhost | eval delay_sec=_indextime-_time | timechart min(delay_sec) avg(delay_sec) max(delay_sec) by source
3. Look at the delay per host for the Splunk internal logs.
index=_internal source=*splunkd.log* | eval delay_sec=_indextime-_time | timechart min(delay_sec) avg(delay_sec) max(delay_sec) by host
Run these searches on realtime - all time mode for a little while to see the events that are being received right now. In addition to real-time, you can run historical searches to compare a day this week to a day from a previous week.
Compare the delay between hosts and forwarders.
- Determine the common denominator between them. For example, all of the delayed events might be from the same log file or the same host or source type.
Compare the delay from your events with the delay from the internal Splunk logs.
- If all the logs are delayed, including the internal logs, then the delay is a forwarding issue.
- If some sources are delayed but not others, this indicates a problem with the input.
As you implement each fix below, you can measure how well it's working by running these searches again.
Root causes
There are several possible root causes. Some might not be applicable to your situation.
Possible thruput limits
Universal and light forwarders have a default thruput limit of 256KBps. This default can be configured in limits.conf. The default value is correct for a forwarder with a low profile, indexing up to ~920 MB/hour. But in the case of higher indexing volumes, or when the forwarder has to collect the historical logs after the first start, the default might be too low. This could delay the recent events.
To check the forwarder default thruput limit, on the command line in the splunk folder type:
cd $SPLUNK_HOME/bin ./splunk cmd btool limits list thruput --debug
Example of result on a universal forwarder. The default limit is 256KBps, and is set up in the app:
/opt/splunkforwarder/etc/apps/SplunkUniversalForwarder/default/limits.conf [thruput] /opt/splunkforwarder/etc/apps/SplunkUniversalForwarder/default/limits.conf maxKBps = 256
Example of result on an indexer or heavy forwarder. The default is unlimited:
/opt/splunk/etc/system/default/limits.conf [thruput] /opt/splunk/etc/system/default/limits.conf maxKBps = 0
To verify in the forwarder: When the thruput limit is reached, monitoring pauses and the following events are recorded in splunkd.log:
INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
To verify how often the forwarder is hitting this limit, check the forwarder's metrics.log. (Look for this on the forwarder because metrics.log is not forwarded by default on universal and light forwarders.)
cd $SPLUNK_HOME/var/log/splunk grep "name=thruput" metrics.log
Example: The instantaneous_kbps and average_kbps are always under 256KBps.
11-19-2013 07:36:01.398 -0600 INFO Metrics - group=thruput, name=thruput, instantaneous_kbps=251.790673, instantaneous_eps=3.934229, average_kbps=110.691774, total_k_processed=101429722, kb=7808.000000, ev=122
Remedy
Create a custom limits.conf with a higher limit or no limit. The configuration can be in system/local, or in an app that will have precedence on the existing limit.
Example: Configure in a dedicated app, in /opt/splunkforwarder/etc/apps/Gofaster/local/limits.conf
Double the thruput limit, from 256 to 512 KBps:
[thruput] maxKBps = 512
Or for unlimited thruput:
[thruput] maxKBps = 0
Notes:
- Unlimited speed can cause higher resource usage on the forwarder. Keep a limit if you need to control the monitoring and network usage.
- Restart to apply.
- Verify the result of the configuration with btool.
- Later, verify in metrics.log that the forwarder is not reaching the new limit constantly.
Possible network limits
Once the thruput limit is removed, if the events are still slow, use the metrics method to check if the forwarders are hitting a network limit. Compare with other forwarders on different networks or different VPN tunnels.
Monitoring archived logs file
Compressed files (like .gz and .zip) are handled by the Archive processor, and are serialized. Therefore if you index a large set of compressed files, they will come through the indexer one after the other. The second file will only come through after the first one has been indexed.
Example: splunk ArchiveProcessor starting to read file in splunkd.log
12-12-2013 00:18:07.388 +0000 INFO ArchiveProcessor - handling file=/var/log/application/rsyslog.log.1.gz
Remedy
An available workaround is to have Splunk to monitor the uncompressed files.
Possible timestamp/timezone issue
Using a real time-alltime search, if an event is visible immediately in real time, but not otherwise (that is, with an historical search), it might be that the time stamp of the event is in the future. Historical searches (even all-time) show events with timestamps only within the window of the search.
Use this search to verify the source type, the time stamp detected (_time), the time of the user on the search head (now), and the time zone applied (date_zone).
source=mysource host=myhost | eval delay_sec=_indextime-_time | convert ctime(_indextime) AS indextime | eval now=now() | table _time indextime now date_zone source sourcetype host
Notes:
- The _time is converted to the user profile time zone configured on the search head at search time.
- The date_zone is applied at index time on the indexer.
Remedy
Fix the time zone and time stamp extraction. Take a sample of the data and test it with data preview.
Windows event logs delay
If the only events delayed are WinEventLogs, and the forwarder is on a busy domain controller, with a high number of events per second, you might be encountering a Windows collection log performance limit.
Or if the forwarder was recently started, it might be still collecting the older events first.
Remedy
- If possible, collect only recent logs (
current_only=1
), or monitor the recent events first (start_from=newest
). See Monitor Windows event log data in Getting Data In.
- Reduce the volume of events to collect. Use the filters in the WinEventLog stanza to exclude particular EventCodes. See inputs.conf.spec.
Identify and triage indexing performance problems | Garbled events |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9, 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10, 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.1.13, 8.1.14, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 8.2.10, 8.2.11, 8.2.12, 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4, 9.0.5, 9.0.6, 9.0.7, 9.0.8, 9.0.9, 9.0.10, 9.1.0, 9.1.1, 9.1.2, 9.1.3, 9.1.4, 9.1.5, 9.1.6, 9.1.7, 9.2.0, 9.2.1, 9.2.2, 9.2.3, 9.2.4, 9.3.0, 9.3.1, 9.3.2, 9.4.0
Feedback submitted, thanks!