Reduce tsidx disk usage

The tsidx retention policy determines how long the indexer retains the tsidx files that it uses to search efficiently and quickly across its data. By default, the indexer retains the tsidx files for all its indexed data for as long as it retains the data itself. By adjusting the policy to remove tsidx files associated with older data, you can set the optimal trade-off between storage costs and search performance.

The indexer stores tsidx files in buckets alongside the rawdata files. The tsidx files are vital for efficient searching across large amounts of data. They also occupy substantial amounts of storage.

For data that you are regularly running searches across, you absolutely need the tsidx files. However, if you have data that requires only infrequent searching as it ages, you can adjust the tsidx retention policy to reduce the tsidx files once they reach a specified age. This allows you to reduce the disk space that your indexed data occupies.

The tsidx reduction process eliminates the full-size tsidx files and replaces them with mini versions of those files that contain essential metadata. The rawdata files and some other metadata files remain untouched. You can continue to search across the aged data, if necessary, but such searches will exhibit significantly worse performance. Rare term searches, in particular, will run slowly.

To summarize, the main use case for tsidx reduction is for environments where most searches run against recent data. In that case, fast access to older data might not be worth the cost of storing the tsidx files. By reducing tsidx files for older data, you incur little performance hit for most searches while gaining large savings in disk usage.

Estimate the storage savings

Tsidx reduction replaces a bucket's full-size tsidx files with smaller versions of those files, known as mini-tsidx files. It also eliminates the bucket's merged_lexicon.lex file.

The full-size tsidx files usually constitute a large portion of the overall bucket size. The exact amount depends on the type of data. Data with many unique terms requires larger tsidx files. As a general guideline, the tsidx reduction process decreases bucket size by approximately one-third to two-thirds. For example, a 1GB bucket decreases in size to somewhere between 350MB and 700MB.

To make a rough estimate of a bucket's reduction potential, look at the size of its merged_lexicon.lex file. The merged_lexicon.lex file is an indicator of the number of unique terms in a bucket's data. Buckets with larger merged_lexicon.lex files have tsidx files that reduce to a greater degree, because of the greater number of unique terms.

The size of a mini-tsidx file is generally about 5% to 10% of the size of the corresponding original, full-size file. As mentioned earlier, however, the overall reduction in bucket size is less than that - typically, one-third to two-thirds. This is because, in addition to the mini-tsidx files, the reduced bucket retains the rawdata file and a number of metadata files.

How tsidx reduction works

When you enable tsidx reduction, you specify a reduction age, on a per-index basis. When buckets in that index reach the specified age, the indexer reduces their tsidx files.

The reduction process

The tsidx reduction process runs, by default, every ten minutes. It checks each bucket in the index and reduces the tsidx files in any bucket whose most recent event is at least the specified reduction age.

The reduction process runs on only a single bucket at a time. If multiple buckets are ready for reduction, the process handles them sequentially.

The reduction process is fast. For example, when running on a 1GB bucket, it typically completes in just a few seconds.

Once a tsidx file is reduced, it stays reduced. If you disable the tsidx reduction setting or increase the reduction age, the change affects only buckets that are not already reduced. If necessary, however, there is a way to convert reduced buckets back into buckets with full tsidx files. See Restore reduced buckets to their orginal state.

Effect of reduction on bucket files

The tsidx reduction process eliminates the full-size tsidx files from each targeted bucket after replacing them with mini versions that contain only essential metadata. The mini-tsidx file consists of the header of the original tsidx file, which contains metadata about each event. In addition, tsidx reduction eliminates the bucket's merged_lexicon.lex file.

The bucket retains its rawdata file, along with the mini-tsidx files and certain other metadata files, including the bloomfilter file.

Full size tsidx files have a .tsidx filename extension. Mini-tsidx files use the .mini.tsidx extension.

Note: The full-size version of the tsidx file gets deleted only after the mini version has been created. This means that the bucket will briefly contain both versions of the file, with the commensurate increase in disk usage.

Effect of reduction on in-progress searches

If a search is in progress on a particular bucket that qualifies for tsidx reduction, the reduction for that bucket will be delayed until the search on the bucket completes. The mini-tsidx files will be created but deletion of the full-size files will await the search completion.

Note: If the indexer is performing a search that ranges across multiple buckets, including one that is ready for reduction, reduction of the bucket might complete before the search reaches it. As expected, when the search does reach the reduced bucket, it will run slowly on that bucket.

Searches across reduced buckets

Once a bucket has undergone tsidx reduction, you can run searches across the bucket, but they will take much longer to complete. Since the indexer searches the most recent buckets first, it will return results from all non-reduced buckets before it reaches the reduced buckets.

When the search hits the reduced buckets, a message appears in Splunk Web to warn users of a potential delay in search completion: "Search on most recent data has completed. Expect slower search speeds as we search the minified buckets."

The following search commands do not work with reduced buckets: typeahead, tstats, and walklex. A warning is added to search.log if such a search touches a reduced bucket: "The full buckets will return results and the reduced buckets will return 0 results." In addition, for the tstats command only, the following message appears in Splunk Web: "Reduced buckets were found in index={index}. Tstats searches are not supported on reduced buckets. Search results will be incorrect."

Note: Tsidx reduction does not touch tsidx files for accelerated data models, which are maintained in their own directories, separate from the index buckets. Therefore, tstats commands that are restricted to an accelerated data model will continue to function normally and are not affected by this feature.

Configure the tsidx retention policy

By default, the indexer retains all tsidx files for the life of the buckets. To change the policy, you must enable tsidx reduction.

You can also change the tsidx retention period from its default of seven days. A bucket gets reduced only when all events in the bucket exceed the retention period.

Configure through Splunk Web

To enable tsidx reduction on an index, edit the index:

1. Navigate to Settings > Indexes.

2. Click the name of the index that you want to edit.

3. Go to the Storage Optimization section of the Edit screen.

4. In the Tsidx Retention Policy field, click Enable Reduction.

5. To modify the default retention period, edit the "Reduce tsidx files older than" field.

6. Click Save.

Configure in indexes.conf

You can enable tsidx reduction by directly editing indexes.conf. You can enable reduction for one or more indexes individually or for all indexes globally.

To enable tsidx reduction for a single index, place the relevant attributes under the index's stanza in indexes.conf. For example, to enable reduction for the "newone" index and to set the retention period to 10 days:

[newone]
enableTsidxReduction = true
timePeriodInSecBeforeTsidxReduction = 864000

To enable tsidx reduction for all indexes, place the settings under the [default] stanza.

You must restart the indexer for the settings to take effect.

Configure through the CLI

To enable tsidx reduction, with a 10 day retention period, on an index called "newone":

splunk edit index newone -enableTsidxReduction true -timePeriodInSecBeforeTsidxReduction 864000

You do not need to restart the indexer after running this command.

Performance impact when you first enable tsidx reduction

Once you enable tsidx reduction, the indexer begins to look for buckets to reduce. It reduces all buckets that exceed the specified retention period. The indexer reduces only one bucket at a time, so performance impact should be minimal.

Determine whether a bucket is reduced

Run the dbinspect search command:

| dbinspect index=_internal

The tsidxState field in the results specifies "full" or "mini" for each bucket.

Tsidx reduction and indexer clusters

An indexer cluster runs tsidx reduction in the same way, and according to the same rules and settings, as a standalone indexer. However, since only searchable bucket copies have tsidx files to begin with, reduction only occurs on searchable copies. With tsidx reduction enabled, a searchable bucket copy can contain either a full-size or a mini tsidx file, depending on the age of the bucket.

You must push changes to the tsidx reduction settings by means of the configuration bundle method. This ensures that all peer nodes use the same settings. Tsidx reduction then occurs at approximately the same time for all searchable copies of a reduction-ready bucket, no matter what peers they reside in.

If, post-reduction, the cluster must convert a non-searchable copy of a reduced bucket to searchable to meet the search factor, there are two ways that the conversion can proceed:

If another searchable copy of the bucket exists in the cluster, the cluster will stream that copy's mini-tsidx files to the non-searchable copy. When streaming is complete, the copy is considered searchable.
If no other searchable copy of the bucket exists, the cluster has no mini-tsidx files available for streaming to the non-searchable copy. In that case, the cluster must first build full-size tsidx files from the non-searchable copy's rawdata file and then reduce the full-size files. There is no way to create mini-tsidx files directly from a rawdata file.

For more information on how an indexer cluster makes non-searchable copies of a bucket searchable, see Bucket-fixing scenarios.

Restore reduced buckets to their original state

You cannot restore reduced buckets to their original state merely by increasing the age setting for tsidx reduction. That setting does not affect buckets that have already been reduced.

Instead, to revert a bucket with mini-tsidx files to full-size tsidx files:

1. Stop the indexer.

2. In indexes.conf, either disable tsidx reduction or increase the age setting for tsidx reduction beyond the age of the buckets that you want to restore. Otherwise, the bucket will be reduced for a second time soon after you revert it.

3. Run the splunk rebuild command on the bucket:

splunk rebuild <bucket directory>

See "Rebuild a single bucket."

4. Restart the indexer.

Related answers from Splunk Community

Reduce tsidx disk usage

Estimate the storage savings

How tsidx reduction works

The reduction process

Effect of reduction on bucket files

Effect of reduction on in-progress searches

Searches across reduced buckets

Configure the tsidx retention policy

Configure through Splunk Web

Configure in indexes.conf

Configure through the CLI

Performance impact when you first enable tsidx reduction

Determine whether a bucket is reduced

Tsidx reduction and indexer clusters

Restore reduced buckets to their original state

Comments

Reduce tsidx disk usage

Was this topic useful?