Splunk® Enterprise

Managing Indexers and Clusters of Indexers

Download manual as PDF

Download topic as PDF

Configure bloom filters

This topic talks about bloom filters and how Splunk Enterprise uses them to improve search performance, particularly for rare term searches.

Before you continue reading this topic, you should be familiar with how the indexer stores data and how the data ages after it has been indexed. Basically, indexed data resides in database directories consisting of subdirectories called buckets. Each index has its own set of databases. As data ages, it moves through several types of buckets (hot, warm, cold, and frozen). Read more about "How the indexer stores data" and "How data ages".

Why bloom filters?

A Bloom filter is a data structure that is used to test whether an element is a member of a set. Our implementation stores the bloom filter as a file on disk in each bucket. When you run a search, especially when you are searching for rare terms, using bloom filters significantly decreases the time it takes to retrieve events from an index.

As the indexer indexes your time-series data, it creates a compressed file that contains the raw data broken into events based on timestamps and a set of time-series index (tsidx) files. The tsidx files are lexicon files that act as a dictionary of all the keywords in your data (error codes, response times, etc.) and contain references to the location of events in the raw data. When you run a search, the indexer searches the tsidx files for the keywords and retrieves the events from the referenced raw data file.

Bloom filters work at the bucket level and use a separate file, bloomfilter, which is basically a hash table that can tell you that a keyword definitely does not exist in a bucket. Then, when you run a search, the indexer only need search the tsidx files in the buckets that the bloom filters do not rule out. The execution cost of retrieving events from disk grows with the size and number of tsidx files. Because they decrease the number of tsidx files that the indexer need search, bloom filters decrease the time it takes to search each bucket.

Instead of storing all of the unique keywords found in a bucket's tsidx files, the bloom filter computes a hash for each keyword. Multiple keywords can result in the same hash, which means that you can have false positives but never false negatives. Because of this, bloom filters can quickly rule out terms that definitely do not exist in a particular bucket and the indexer moves on to searching the next bucket. If the bloom filter cannot rule out a bucket (the keyword may or may not actually exist in the bucket), the indexer searches the bucket normally.

Configure bloom filters

Note: This section is not relevant to SmartStore indexes. For SmartStore indexes, bloomfilters must be enabled and they must use the default path. See About SmartStore.

Bloom filters are created when buckets roll from hot to warm. By default, they are deleted when the buckets roll to frozen, unless you have configured a different retention behavior. This section talks about the configuration file parameters you can use to configure and manage your bloomfilter files.

To specify whether or not you want to use bloom filters, use the use_bloomfilter parameter in limits.conf:

[search]
use_bloomfilter = true|false
* Control whether to use bloom filters to rule out buckets.
* Defaults to True.

To create a bloom filter for a specific index, edit the following Per Index options in indexes.conf:

bloomHomePath = <path on indexer>
    * The location where the bloom filter files for the index are stored.
    * If specified, it must be defined in terms of a volume definition.
    * If not specified, bloom filter files for the index will be stored inside the bucket directories.
    * The path must be writable.
    * You must restart splunkd after changing this parameter.

createBloomfilter = true|false
* Determines whether to create bloom filter files for the index.
* Defaults to "true".

In addition, you must use volumes if you explicitly define bloomHomePath. A volume represents a directory on the file system where indexed data resides. For more information, read "Configure maximum index size".

PREVIOUS
Reduce tsidx disk usage
  NEXT
Determine which indexes.conf changes require restart

This documentation applies to the following versions of Splunk® Enterprise: 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.3.0


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters