How indexing works
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
How indexing works
Indexing is how Splunk processes the data you send it. Splunk can index any time-series data, which is data that has a timestamp associated with it. If the data does not have a timestamp, Splunk will apply the current time to the data as it indexes it. When data is indexed, Splunk breaks it into events based on its timestamps; you can also specify other event delimiters, such as a regex match or whitespace.
All data that comes into Splunk is indexed through the universal pipeline. Data enters the universal pipeline as large (10,000 bytes) chunks. As part of pipeline processing, these chunks are broken into events. Initially, newline characters signal an event boundary. In the next stage of processing, Splunk applies line merging rules specified in props.conf.
As part of indexing, events are broken into sections called segments. Splunk uses a list of breaking characters and other rules (such as the maximum number of characters per segment) that are configurable through segmenters.conf.
Indexing is an I/O-intensive process. If you're building a system to index a lot of data, Splunk recommends you take this into consideration.
The splunk-optimize process
While Splunk is indexing data, one or more instances of the splunk-optimize process will run intermittently, merging index files together to optimize performance when searching the data. The splunk-optimize process can use a significant amount of cpu, but should not consume it indefinitely, only for a short amounts of time. You can alter the number of concurrent instances of splunk-optimize by changing the value set for maxConcurrentOptimizes in indexes.conf, but this is not typically necessary.
splunk-optimize should only run on db-hot.
You can run it on warm DB's manually if you find one with a larger number of .tsidx files (more than 25) - ./splunk-optimize <directory>
If splunk-optimize does not run often enough, search efficiency will be affected.
What's in an index?
Splunk stores all processed data in indexes. Indexes, in turn, are stored in databases, which are located in $SPLUNK_HOME/var/lib/splunk. A database is a directory named db_<starttime>_<endtime>_<seq_num>. An index is a collection of database directories.
Splunk comes with preconfigured indexes:
- main: the default Splunk index. All processed data is stored here unless otherwise specified.
- splunklogger: Splunk keeps track of its internal logs in this index.
- _internal: this index includes metrics from Splunk's processors.
- sampledata: a small amount of sample data is stored here for training purposes.
- _thefishbucket: internal information on file processing.
- _audit: events from the file system change monitor, auditing, and all user search history.
Read About managing indexes in this manual for more information.
This documentation applies to the following versions of Splunk: 4.0 , 4.0.1 , 4.0.2 , 4.0.3 , 4.0.4 , 4.0.5 , 4.0.6 , 4.0.7 , 4.0.8 , 4.0.9 , 4.0.10 , 4.0.11 View the Article History for its revisions.
