How Splunk handles log file rotation
How Splunk handles log file rotation
Splunk recognizes when a file that it is monitoring (such as
/var/log/messages) has been rolled (
/var/log/messages1) and will not read the rolled file a second time.
How to work with log rotation into compressed files
Splunk does not identify compressed files produced by logrotate (such as
gz) as being the same as the uncompressed originals. This can lead to a duplication of data if those files are then monitored by Splunk.
To avoid this problem, you can choose from two approaches:
- Configure logrotate to move those files into a directory that Splunk is not monitoring.
- Set blacklist rules for archive filetypes to prevent Splunk from reading those files as new logfiles.
blacklist = \.(gz|bz2|z|zip)$
Splunk recognizes the following archive filetypes:
tar, gz, bz2, tar.gz, tgz, tbz, tbz2, zip, and
For more information on setting blacklist rules see "Whitelist or blacklist specific incoming data" in this manual.
On a related issue, if you add new data to an existing compressed archive such as a
.gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.
How Splunk recognizes log rotation
The monitoring processor picks up new files and reads the first 256 bytes of the file. This data is hashed into a begin and end cyclic redundancy check (CRC), which functions as a fingerprint representing the file content. Splunk uses this CRC to look up an entry in a database that contains all the beginning CRCs of files Splunk has seen before. If successful, the lookup returns a few values, but the important ones are a seekAddress, meaning the number of bytes into the known file that Splunk has already read, and a seekCRC which is a fingerprint of the data at that location.
Using the results of this lookup Splunk can attempt to categorize the file.
There are three possible outcomes of a CRC check:
1. There is no matching record for the CRC from the file beginning in the database. This indicates a new file. Splunk will pick it up and consume its data from the start of the file. Splunk updates the database with the new CRCs and Seek Addresses as the file is being consumed.
2. There is a matching record for the CRC from the file beginning in the database, the content at the Seek Address location matches the stored CRC for that location in the file, and the size of the file is larger than the Seek Address that Splunk stored. This means that while Splunk has seen the file before, there has been data added to it since it was last read. Splunk opens the file, seeks to Seek Address--the end of the file when Splunk last finished with it--and starts reading from there. In this way, Splunk will only read the new data and not anything it has read before.
3. There is a matching record for the CRC from the file beginning in the database, but the content at the Seek Address location does not match the stored CRC at that location in the file. This means that Splunk has previously read some file with the same initial data, but either some of the material that it read has since been modified in place, or it is in fact a wholly different file which simply begins with the same content. Since Splunk's database for content tracking is keyed to the beginning CRC, it has no way to track progress independently for the two different data streams, and further configuration is required.
Important: Since the CRC start check is run against only the first 256 bytes of the file by default, it is possible for non-duplicate files to have duplicate start CRCs, particularly if the files are ones with identical headers. To handle such situations you can
- Use the
initCrcLengthattribute to increase the number of characters used for the CRC calculation, and make it longer than your static header.
- Use the
crcSaltattribute when configuring the file in
inputs.conf, as described in "Edit inputs.conf" in this manual. The
crcSaltattribute ensures that each file has a unique CRC. The effect of this setting is that each pathname is assumed to contain unique content. You do not want to use this attribute with rolling log files, or any other scenario in which logfiles are renamed or moved to another monitored location, because it defeats Splunk's ability to recognize rolling logs and will cause Splunk to re-index the data.
This documentation applies to the following versions of Splunk: 4.2 , 4.2.1 , 4.2.2 , 4.2.3 , 4.2.4 , 4.2.5 , 4.3 , 4.3.1 , 4.3.2 , 4.3.3 , 4.3.4 , 4.3.5 , 4.3.6 , 4.3.7 , 5.0 , 5.0.1 , 5.0.2 , 5.0.3 , 5.0.4 , 5.0.5 , 5.0.6 , 6.0