Archive indexed data

Note: Although SmartStore indexes do not usually contain cold buckets, you still use the attributes described here (coldToFrozenDir and coldToFrozenScript) to archive SmartStore buckets as they roll directly from warm to frozen. See Configure data retention for SmartStore indexes.

You can configure the indexer to archive your data automatically as it ages; specifically, at the point when it rolls to "frozen". To do this, you configure indexes.conf.

Caution: By default, the indexer deletes all frozen data. It removes the data from the index at the moment it becomes frozen. If you need to keep the data around, you must configure the indexer to archive the data before removing it. You do this by either setting the coldToFrozenDir attribute or specifying a valid coldToFrozenScript in indexes.conf.

For detailed information on data storage, see How the indexer stores indexes. For information on editing indexes.conf, see Configure index storage.

How the indexer archives data

The indexer rotates old data out of the index based on your data retirement policy, as described in Set a retirement and archiving policy. Data moves through several stages, which correspond to file directory locations. Data starts out in the hot database, located as subdirectories ("buckets") under $SPLUNK_HOME/var/lib/splunk/defaultdb/db/. It then moves to the warm database, also located as subdirectories under $SPLUNK_HOME/var/lib/splunk/defaultdb/db. Eventually, data is aged into the cold database $SPLUNK_HOME/var/lib/splunk/defaultdb/colddb.

Finally, data reaches the frozen state. This can happen for a number of reasons, as described in Set a retirement and archiving policy. At this point, the indexer erases the data from the index. If you want the indexer to archive the frozen data before erasing it from the index, you must specify that behavior. You can choose two ways of handling the archiving:

The archiving behavior depends on which of these indexes.conf attributes you set:

coldToFrozenDir. This attribute specifes a location where the indexer will automatically archive frozen data.
coldToFrozenScript. This attribute specifes a user-supplied script that the indexer will run when the data is frozen. Typically, this will be a script that archives the frozen data. The script can also serve some other purpose altogether. While the indexer ships with one example archiving script that you can edit and use ($SPLUNK_HOME/bin/coldToFrozenExample.py), you can actually specify any script you want the indexer to run.

Note: You can only set one or the other of these attributes. The coldToFrozenDir attribute takes precedence over coldToFrozenScript, if both are set.

If you don't specify either of these attributes, the indexer runs a default script that simply writes the name of the bucket being erased to the log file $SPLUNK_HOME/var/log/splunk/splunkd_stdout.log. It then erases the bucket.

Let the indexer archive the data for you

If you set the coldToFrozenDir attribute in indexes.conf, the indexer will automatically copy frozen buckets to the specified location before erasing the data from the index.

Add this stanza to $SPLUNK_HOME/etc/system/local/indexes.conf:

[<index>]
coldToFrozenDir = <path to frozen archive>

Note the following:

<index> specifies which index contains the data to archive.
<path to frozen archive> specifies the directory where the indexer will put the archived buckets.

Note: When you use Splunk Web to create a new index, you can also specify a frozen archive path for that index. See Create custom indexes for details.

How the indexer archives the frozen data depends on whether the data was originally indexed in a pre-4.2 release:

For buckets created from version 4.2 and on, the indexer will remove all files except for the rawdata file.

For pre-4.2 buckets, the script simply gzip's all the .tsidx and .data files in the bucket.

This difference is due to a change in the format of rawdata. Starting with 4.2, the rawdata file contains all the information the indexer needs to reconstitute an index bucket.

For information on thawing these buckets, see Restore archived indexed data.

Specify an archiving script

If you set the coldToFrozenScript attribute in indexes.conf, the script you specify will run just before the indexer erases the frozen data from the index.

You'll need to supply the actual script. Typically, the script will archive the data, but you can provide a script that performs any action you want.

Add this stanza to $SPLUNK_HOME/etc/system/local/indexes.conf:

[<index>]
coldToFrozenScript = ["<path to program that runs script>"] "<path to script>"

Note the following:

<index> specifies which index contains the data to archive.
<path to script> specifies the path to the archiving script. The script must be in $SPLUNK_HOME/bin or one of its subdirectories.
<path to program that runs script> is optional. You must set it if your script requires a program, such as python, to run it.
If your script is located in $SPLUNK_HOME/bin and is named myColdToFrozen.py, set the attribute like this:

coldToFrozenScript = "$SPLUNK_HOME/bin/python" "$SPLUNK_HOME/bin/myColdToFrozen.py"

For detailed information on the archiving script, see the indexes.conf spec file.

The indexer ships with an example archiving script that you can edit, $SPLUNK_HOME/bin/coldToFrozenExample.py.

Note: If using the example script, edit it to specify the archive location for your installation. Also, rename the script or move it to another location to avoid having changes overwritten when you upgrade the indexer. This is an example script and should not be applied to a production instance without editing to suit your environment and testing extensively.

The example script archives the frozen data differently, depending on whether the data was originally indexed in a pre-4.2 release:

For buckets created from version 4.2 and on, it will remove all files except for the rawdata file.

For pre-4.2 buckets, the script simply gzip's all the .tsidx and .data files.

This difference is due to a change in the format of rawdata. Starting with 4.2, the rawdata file contains all the information the indexer needs to reconstitute an index bucket.

For information on thawing these buckets, see Restore archived indexed data.

As a best practice, make sure the script you create completes as quickly as possible, so that the indexer doesn't end up waiting for the return indicator. For example, if you want to archive to a slow volume, set the script to copy the buckets to a temporary location on the same (fast) volume as the index. Then use a separate script, outside the indexer, to move the buckets from the temporary location to their destination on the slow volume.

Data archiving and indexer clusters

In an Indexer cluster, each individual peer node rolls its buckets to frozen, in the same way that a non-clustered indexer does; that is, based on its own set of configurations. Because all peers in a cluster should be configured identically, all copies of a bucket should roll to frozen at approximately the same time.

However, there can be some variance in the timing, because the same index can grow at different rates on different peers. The cluster performs processing to ensure that buckets freeze smoothly across all peers in the cluster. Specifically, it performs processing so that, if a bucket is frozen on one peer but not on another, the cluster does not initiate fix-up activities for that bucket. See How the cluster handles frozen buckets.

The problem of archiving multiple copies

Because indexer clusters contain multiple copies of each bucket. If you archive the data using the techniques described earlier in this topic, you archive multiple copies of the data.

For example, if you have a cluster with a replication factor of 3, the cluster stores three copies of all its data across its set of peer nodes. If you set up each peer node to archive its own data when it rolls to frozen, you end up with three archived copies of the data. You cannot solve this problem by archiving just the data on a single node, since there's no certainty that a single node contains all the data in the cluster.

The solution to this would be to archive just one copy of each bucket on the cluster and discard the rest. However, in practice, it is not feasible to create a process to reliably do so.

Specifying the archive destination

If you choose to archive multiple copies of the clustered data, you must guard against name collisions. You cannot route the data from all peer nodes into a single archive directory, because multiple, identically named copies of the bucket will exist across the cluster (for deployments where replication factor >= 2), and the contents of a directory must be named uniquely. Instead, you need to ensure that the buckets from each of your peer nodes go to a separate archive directory. This, of course, will be somewhat difficult to manage if you specify a destination directory in shared storage by means of the coldToFrozenDir attribute in indexes.conf, because the indexes.conf file must be the same across all peer nodes, as discussed in Configure the peer indexes in an indexer cluster. One alternative approach would be to create a script that directs each peer's archived buckets to a separate location on the shared storage, and then use the coldToFrozenScript attribute to specify that script.

Related answers from Splunk Community

Archive indexed data

How the indexer archives data

Let the indexer archive the data for you

Specify an archiving script

Data archiving and indexer clusters

The problem of archiving multiple copies

Specifying the archive destination

Comments

Archive indexed data

Was this topic useful?