Configure bloom filters
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Configure bloom filters
This topic talks about bloom filters and how Splunk uses them to improve search performance, particularly for rare term searches.
Before you continue reading this topic, you should be familiar with how Splunk stores data and how the data ages once it's in Splunk. Basically, indexed data resides in database directories consisting of subdirectories called buckets. Each index has its own set of databases. As data ages, it moves through several types of buckets (hot, warm, cold, and frozen). Read more about how data ages.
Why bloom filters?
A Bloom filter is a data structure that is used to test whether an element is a member of a set. Our implementation stores the bloom filter as a file on disk in each bucket. When you run a search, especially when you are searching for rare terms, using bloom filters significantly decreases the time it takes to retrieve events from an index. For more details about the bloom filter algorithm and properties, see the "Bloom Filter" topic on Wikipedia ( http://en.wikipedia.org/wiki/Bloom_filter ).
As Splunk indexes your time-series data, it creates a compressed file that contains the raw data broken into events based on timestamps and a set of time series index (tsidx) files. The tsidx files are lexicon files that act as a dictionary of all the keywords in your data (error codes, response times, etc.) and contain references to the location of events in the raw data. When you run a search, Splunk searches the tsidx files for the keywords and retrieves the events from the referenced raw data file.
Bloom filters work at the bucket level and use a separate file,
bloomfilter, which is basically a hash table that can tell you that a keyword definitely does not exist in a bucket. Then, when you run a search, Splunk only needs to search the tsidx files in the buckets that the bloom filters do not rule out. The execution cost of retrieving events from disk grows with the size and number of tsidx files. Because they decrease the number of tsidx files that Splunk needs to search, bloom filters decrease the time it takes to search each bucket.
Instead of storing all of the unique keywords found in a bucket's tsidx files, the bloom filter computes a hash for each keyword. Multiple keywords can result in the same hash, which means that you can have false positives but never false negatives. Because of this, bloom filters can quickly rule out terms that definitely do not exist in a particular bucket and Splunk moves on to searching the next bucket. If the bloom filter cannot rule out a bucket (the keyword may or may not actually exist in the bucket), Splunk searches the bucket normally.
Configuring bloom filters
Bloom filters are created when buckets roll from hot to warm. By default, they are deleted when the buckets roll to frozen, unless you have configured a different retention behavior. This section talks about the configuration file parameters you can use to configure and manage your bloomfilter files.
To specify whether or not you want to use bloomfilters, use the
use_bloomfilter parameter in
[search] use_bloomfilter = true|false * Control whether to use bloom filters to rule out buckets. * Defaults to True.
To create a bloomfilter for a specific index, edit the following Per Index options in
bloomHomePath = <path on index server> * An absolute path that contains the bloomfilter files for the index. * May be different from homePath and may be on any disk drive. * Defaults to homePath. * CAUTION: Path must be writable. createBloomfilter = true|false * Control whether to create bloomfilter files for the index. * TRUE: bloomfilter files will be created. FALSE: not created. * Defaults to TRUE.
Note: A marker file,
corrupt.bloomOnly.maker, appears temporarily when the bucket rolls from hot to warm. The name of this file is misleading; it does not indicate a corrupt bloomfilter. No remdial action is needed on your part; just ignore the file.