Getting Data In

 


How Splunk handles log file rotation

How Splunk handles log file rotation

Splunk recognizes when a file that it is monitoring (such as /var/log/messages) has been rolled (/var/log/messages1) and will not read the rolled file a second time.

How to work with log rotation into compressed files

Splunk does not identify compressed files produced by logrotate (such as bz2 or gz) as being the same as the uncompressed originals. This can lead to a duplication of data if those files are then monitored by Splunk.

To avoid this problem, you can choose from two approaches:

  • Configure logrotate to move those files into a directory that Splunk is not monitoring.
  • Set blacklist rules for archive filetypes to prevent Splunk from reading those files as new logfiles.
Example:
 blacklist = \.(gz|bz2|z|zip)$ 

Splunk recognizes the following archive filetypes: tar, gz, bz2, tar.gz, tgz, tbz, tbz2, zip, and z.

For more information on setting blacklist rules see "Whitelist or blacklist specific incoming data" in this manual.

On a related issue, if you add new data to an existing compressed archive such as a .gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.

How Splunk recognizes log rotation

The monitoring processor picks up new files and reads the first and last 256 bytes of the file. This data is hashed into a begin and end cyclic redundancy check (CRC). Splunk checks new CRCs against a database that contains all the CRCs of files Splunk has seen before. The location Splunk last read in the file, known as the file's seekPtr, is also stored.

There are three possible outcomes of a CRC check:

1. There is no begin and end CRC matching this file in the database. This indicates a new file. Splunk will pick it up and consume its data from the start of the file. Splunk updates the database with the new CRCs and seekPtrs as the file is being consumed.

2. The begin CRC and the end CRC are both present, but the size of the file is larger than the seekPtr Splunk stored. This means that, while Splunk has seen the file before, there has been data added to it since it was last read. Splunk opens the file, seeks to the previous end of the file, and starts reading from there. In this way, Splunk will only grab the new data and not anything it has read before.

3. The begin CRC is present, but the end CRC does not match. This means that Splunk has previously read the file but that some of the material that it read has since changed. In this case, Splunk must re-read the whole file.

Important: Since the CRC start check is run against only the first 256 bytes of the file, it is possible for non-duplicate files to have duplicate start CRCs, particularly if the files are ones with identical headers. To handle such situations you can

  • Use the initCrcLength attribute to increase the number of characters used for the CRC calculation, and make it longer than your static header.
  • Use the crcSalt attribute when configuring the file in inputs.conf, as described in "Edit inputs.conf" in this manual. The crcSalt attribute ensures that each file has a unique CRC. You do not want to use this attribute with rolling log files, however, because it defeats Splunk's ability to recognize rolling logs and will cause Splunk to re-index the data.

This documentation applies to the following versions of Splunk: 4.2 , 4.2.1 , 4.2.2 , 4.2.3 , 4.2.4 , 4.2.5 , 4.3 , 4.3.1 , 4.3.2 , 4.3.3 , 4.3.4 , 4.3.5 , 4.3.6 , 5.0 , 5.0.1 , 5.0.2 View the Article History for its revisions.


Comments

To me the explanation seems incorrect:
- case 2: matching begin CRC and end CRC, but wrong seekPtr
this is a not really an existing file, must be reread as explained under (3)
- case 3: begin CRC is present, but the end CRC does not match (and seekPtr does not match, too, I would add)
this is a recently rotated log file; should be re-read starting from seekPtr, as explained in (2)
- case 4 is missing: matching begin CRC and end CRC, and same seekPtr
this is an existing file, do nothing

Icssupport
November 30, 2012

About the length of the CRC, the default is 256 chars, but since 5.0 you can increase it, with initCrcLength.
see inputs.conf specifications.

Ykherian, Splunker
November 26, 2012

It might be preferable to allow CRC to be calculated over a larger portion of the file (e.g. the first 1024 bytes) rather than using the crcSalt. Using crcSalt= still might allow the same file with a different name to be read in twice.

Supersleepwalker
April 13, 2012

You must be logged into splunk.com in order to post comments. Log in now.

Was this documentation topic helpful?

If you'd like to hear back from us, please provide your email address:

We'd love to hear what you think about this topic or the documentation as a whole. Feedback you enter here will be delivered to the documentation team.

Feedback submitted, thanks!