Getting Data In

 


How Splunk handles log file rotation

NOTE - Splunk version 4.x reached its End of Life on October 1, 2013. Please see the migration information.

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.

How Splunk handles log file rotation

Splunk recognizes when a file that it is monitoring (such as /var/log/messages) has been rolled (/var/log/messages1) and will not read the rolled file a second time.

How to work with log rotation into compressed files

Splunk does not identify compressed files produced by logrotate (such as bz2 or gz) as being the same as the uncompressed originals. This can lead to a duplication of data if those files are then monitored by Splunk.

To avoid this problem, you can choose from two approaches:

  • Configure logrotate to move those files into a directory that Splunk is not monitoring.
  • Set blacklist rules for archive filetypes to prevent Splunk from reading those files as new logfiles.
Example:
 blacklist = \.(gz|bz2|z|zip)$ 

Splunk recognizes the following archive filetypes: tar, gz, bz2, tar.gz, tgz, tbz, tbz2, zip, and z.

For more information on setting blacklist rules see "Whitelist or blacklist specific incoming data" in this manual.

On a related issue, if you add new data to an existing compressed archive such as a .gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.

How Splunk recognizes log rotation

The monitoring processor picks up new files and reads the first 256 bytes of the file. This data is hashed into a begin and end cyclic redundancy check (CRC), which functions as a fingerprint representing the file content. Splunk uses this CRC to look up an entry in a database that contains all the beginning CRCs of files Splunk has seen before. If successful, the lookup returns a few values, but the important ones are a seekAddress, meaning the number of bytes into the known file that Splunk has already read, and a seekCRC which is a fingerprint of the data at that location.

Using the results of this lookup Splunk can attempt to categorize the file.

There are three possible outcomes of a CRC check:

1. There is no matching record for the CRC from the file beginning in the database. This indicates a new file. Splunk will pick it up and consume its data from the start of the file. Splunk updates the database with the new CRCs and Seek Addresses as the file is being consumed.

2. There is a matching record for the CRC from the file beginning in the database, the content at the Seek Address location matches the stored CRC for that location in the file, and the size of the file is larger than the Seek Address that Splunk stored. This means that while Splunk has seen the file before, there has been data added to it since it was last read. Splunk opens the file, seeks to Seek Address--the end of the file when Splunk last finished with it--and starts reading from there. In this way, Splunk will only read the new data and not anything it has read before.

3. There is a matching record for the CRC from the file beginning in the database, but the content at the Seek Address location does not match the stored CRC at that location in the file. This means that Splunk has previously read some file with the same initial data, but either some of the material that it read has since been modified in place, or it is in fact a wholly different file which simply begins with the same content. Since Splunk's database for content tracking is keyed to the beginning CRC, it has no way to track progress independently for the two different data streams, and further configuration is required.

Important: Since the CRC start check is run against only the first 256 bytes of the file by default, it is possible for non-duplicate files to have duplicate start CRCs, particularly if the files are ones with identical headers. To handle such situations you can

  • Use the initCrcLength attribute to increase the number of characters used for the CRC calculation, and make it longer than your static header.
  • Use the crcSalt attribute when configuring the file in inputs.conf, as described in "Edit inputs.conf" in this manual. The crcSalt attribute ensures that each file has a unique CRC. The effect of this setting is that each pathname is assumed to contain unique content. You do not want to use this attribute with rolling log files, or any other scenario in which logfiles are renamed or moved to another monitored location, because it defeats Splunk's ability to recognize rolling logs and will cause Splunk to re-index the data.

This documentation applies to the following versions of Splunk: 4.2 , 4.2.1 , 4.2.2 , 4.2.3 , 4.2.4 , 4.2.5 , 4.3 , 4.3.1 , 4.3.2 , 4.3.3 , 4.3.4 , 4.3.5 , 4.3.6 , 4.3.7 , 5.0 , 5.0.1 , 5.0.2 , 5.0.3 , 5.0.4 , 5.0.5 , 5.0.6 , 5.0.7 , 5.0.8 , 5.0.9 , 5.0.10 View the Article History for its revisions.


Comments

To me the explanation seems incorrect:
- case 2: matching begin CRC and end CRC, but wrong seekPtr
this is a not really an existing file, must be reread as explained under (3)
- case 3: begin CRC is present, but the end CRC does not match (and seekPtr does not match, too, I would add)
this is a recently rotated log file; should be re-read starting from seekPtr, as explained in (2)
- case 4 is missing: matching begin CRC and end CRC, and same seekPtr
this is an existing file, do nothing

Icssupport
November 30, 2012

About the length of the CRC, the default is 256 chars, but since 5.0 you can increase it, with initCrcLength.
see inputs.conf specifications.

Ykherian, Splunker
November 26, 2012

It might be preferable to allow CRC to be calculated over a larger portion of the file (e.g. the first 1024 bytes) rather than using the crcSalt. Using crcSalt= still might allow the same file with a different name to be read in twice.

Supersleepwalker
April 13, 2012

You must be logged into splunk.com in order to post comments. Log in now.

Was this documentation topic helpful?

If you'd like to hear back from us, please provide your email address:

We'd love to hear what you think about this topic or the documentation as a whole. Feedback you enter here will be delivered to the documentation team.

Feedback submitted, thanks!