How the indexer stores indexes

As the indexer indexes your data, it creates a number of files. These files contain two types of data:

The raw data in compressed form (rawdata)
Indexes that point to the raw data, plus some metadata files (index files, also known as tsidx files)

Together, these files constitute the Splunk Enterprise index. The files reside in sets of directories organized by age. Some directories contain newly indexed data; others contain previously indexed data. The number of such directories can grow quite large, depending on how much data you're indexing.

Why you might care

You might not care, actually. The indexer handles indexed data by default in a way that gracefully ages the data through several stages. After a long period of time, typically several years, the indexer removes old data from your system. You might well be fine with the default scheme it uses.

However, if you're indexing large amounts of data, have specific data retention requirements, or otherwise need to carefully plan your aging policy, you've got to read this topic. Also, to back up your data, it helps to know where to find it. So, read on....

How data ages

Each of the index directories is known as a bucket. To summarize so far:

An "index" contains compressed raw data and associated index files.
An index resides across many age-designated index directories.
An index directory is called a bucket.

A bucket moves through several stages as it ages:

hot
warm
cold
frozen
thawed

As buckets age, they "roll" from one stage to the next. As data is indexed, it goes into a hot bucket. Hot buckets are both searchable and actively being written to. An index can have several hot buckets open at a time.

When certain conditions occur (for example, the hot bucket reaches a certain size or splunkd gets restarted), the hot bucket becomes a warm bucket ("rolls to warm"), and a new hot bucket is created in its place. Warm buckets are searchable, but are not actively written to. There are many warm buckets.

Once further conditions are met (for example, the index reaches some maximum number of warm buckets), the indexer begins to roll the warm buckets to cold, based on their age. It always selects the oldest warm bucket to roll to cold. Buckets continue to roll to cold as they age in this manner. After a set period of time, cold buckets roll to frozen, at which point they are either archived or deleted. By editing attributes in indexes.conf, you can specify the bucket aging policy, which determines when a bucket moves from one stage to the next.

If the frozen data has been archived, it can later be thawed. Thawed data is available for searches.

Here are the stages that buckets age through:

Bucket stage	Description	Searchable?
Hot	Contains newly indexed data. Open for writing. One or more hot buckets for each index.	Yes
Warm	Data rolled from hot. There are many warm buckets. Data is not actively written to warm buckets.	Yes
Cold	Data rolled from warm. There are many cold buckets.	Yes
Frozen	Data rolled from cold. The indexer deletes frozen data by default, but you can choose to archive it instead. Archived data can later be thawed.	No
Thawed	Data restored from an archive. If you archive frozen data, you can later return it to the index by thawing it.	Yes

The collection of buckets in a particular stage is sometimes referred to as a database or "db": the "hot db", the "warm db", the "cold db", etc.

What the index directories look like

Each index occupies its own directory under $SPLUNK_HOME/var/lib/splunk. The name of the directory is the same as the index name. Under the index directory are a series of subdirectories that categorize the buckets by stage (hot/warm, cold, or thawed).

The buckets themselves are subdirectories within those directories. The bucket directory names are based on the age of the data.

Here is the directory structure for the default index (defaultdb):

Bucket stage	Default location	Notes
Hot	`$SPLUNK_HOME/var/lib/splunk/defaultdb/db/*`	There can be multiple hot subdirectories, one for each hot bucket. See "Bucket naming conventions".
Warm	`$SPLUNK_HOME/var/lib/splunk/defaultdb/db/*`	There are separate subdirectories for each warm bucket. See "Bucket naming conventions".
Cold	`$SPLUNK_HOME/var/lib/splunk/defaultdb/colddb/*`	There are multiple cold subdirectories. When warm buckets roll to cold, they get moved into this directory, but are not renamed.
Frozen	N/A: Frozen data gets deleted or archived into a directory location you specify.	Deletion is the default; see "Archive indexed data" for information on how to archive the data instead.
Thawed	`$SPLUNK_HOME/var/lib/splunk/defaultdb/thaweddb/*`	Location for data that has been archived and later thawed. See "Restore archived data" for information on restoring archived data to a thawed state.

The paths for the hot/warm, cold, and thawed directories are configurable, so, for example, you can store cold buckets in a separate location from hot/warm buckets. See "Configure index storage" and "Use multiple partitions for index data".

Important: All index locations must be writable.

Note: In pre-6.0 versions of Splunk Enterprise, replicated copies of cluster buckets always resided in the colddb directory, even if they were hot or warm buckets. Starting with 6.0, hot and warm replicated copies reside in the db directory, the same as for non-replicated copies.

Bucket naming conventions

Bucket names depend on:

The stage of the bucket: hot or warm/cold/thawed
The type of bucket directory: non-clustered, clustered originating, or clustered replicated

Important: Bucket naming conventions are subject to change.

Non-clustered buckets

A standalone indexer creates non-clustered buckets. These use one type of naming convention.

Clustered buckets

An indexer that is part of an indexer cluster creates clustered buckets. A clustered bucket has multiple exact copies. The naming convention for clustered buckets distinguishes the types of copies, originating or replicated.

Briefly, a bucket in an indexer cluster has multiple copies, according to its replication factor. When the data enters the cluster, the receiving indexer writes the data to a hot bucket. This receiving indexer is known as the source cluster peer, and the bucket where the data gets written is called the originating copy of the bucket.

As data is written to the hot copy, the source peer streams copies of the hot data, in blocks, to other indexers in the cluster. These indexers are referred to as the target peers for the bucket. The copies of the streamed data on the target peers are known as replicated copies of the bucket.

When the source peer rolls its originating hot bucket to warm, the target peers roll their replicated copies of that bucket. The warm copies are exact replicas of each other.

For an introduction to indexer cluster architecture and replicated data streaming, read "Basic indexer cluster architecture".

Bucket names

These are the naming conventions:

Bucket type	Hot bucket	Warm/cold/thawed bucket
Non-clustered	`hot_v1_<localid>`	`db_<newest_time>_<oldest_time>_<localid>`
Clustered originating	`hot_v1_<localid>`	`db_<newest_time>_<oldest_time>_<localid>_<guid>`
Clustered replicated	`<localid>_<guid>`	`rb_<newest_time>_<oldest_time>_<localid>_<guid>`

Note:

<newest_time> and <oldest_time> are timestamps indicating the age of the data in the bucket. The timestamps are expressed in UTC epoch time (in seconds). For example: db_1223658000_1223654401_2835 is a warm, non-clustered bucket containing data from October 10, 2008, covering the exact period of 9am-10am.
<localid> is an ID for the bucket. For a clustered bucket, the originating and replicated copies of the bucket have the same <localid>.
<guid> is the guid of the source peer node. The guid is located in the peer's $SPLUNK_HOME/etc/instance.cfg file.

In an indexer cluster, the originating warm bucket and its replicated copies have identical names, except for the prefix (db for the originating bucket; rb for the replicated copies).

Note: In an indexer cluster, when data is streamed from the source peer to a target peer, the data first goes into a temporary directory on the target peer, identified by the hot bucket convention of <localid>_<guid>. This is true for any replicated bucket copy, whether or not the streaming bucket is a hot bucket. For example, during bucket fix-up activities, a peer might stream a warm bucket to other peers. When the replication of that bucket has completed, the <localid>_<guid> directory is rolled into a warm bucket directory, identified by the rb_ prefix.

Buckets and Splunk Enterprise administration

When you are administering Splunk Enterprise, it helps to understand how the indexer stores indexes across buckets. In particular, several admin activities require a good understanding of buckets:

For information on setting a retirement and archiving policy, see "Set a retirement and archiving policy." You can base the retirement policy on either the size or the age of data.

For information on how to archive your indexed data, see "Archive indexed data". To learn how to restore data from archive, read "Restore archived data."

To learn how to back up your data, read "Back up indexed data." That topic also discusses how to manually roll hot buckets to warm, so that you can then back them up.

For information on setting limits on disk usage, see "Set limits on disk usage."

For a list of configurable bucket settings, see "Configure index storage."

For information on configuring index size, see "Configure index size."

For information on partitioning index data, see "Use multiple partitions for index data."

For information on how buckets function in indexer clusters, see "Buckets and indexer clusters."

In addition, see "indexes.conf" in the Admin Manual.

Related answers from Splunk Community

How the indexer stores indexes

Why you might care

How data ages

What the index directories look like

Bucket naming conventions

Non-clustered buckets

Clustered buckets

Bucket names

Buckets and Splunk Enterprise administration

Comments

How the indexer stores indexes

Was this topic useful?