How the indexer stores indexes

As the indexer indexes your data, it creates a number of files:

The raw data in compressed form (the rawdata journal)
Indexes that point to the raw data (tsidx files)
Some other metadata files

Together, these files constitute the Splunk Enterprise index. The files reside in sets of directories, or buckets, organized by age. Each bucket contains a rawdata journal, along with associated tsidx and metadata files. The data in each bucket is bounded by a limited time range.

An index typically consists of many buckets, and the number of buckets grows as the index grows. As data continues to enter the system, the indexer creates new buckets to accommodate the increase in data.

Some buckets on the indexer contain newly indexed data; others contain previously indexed data. The number of buckets in an index can grow quite large, depending on how much data you're indexing and how long you retain the data.

Why the details might, or might not, matter to you

The indexer handles indexed data by default in a way that gracefully ages the data through several states. After a long period of time, typically several years, the indexer removes old data from your system. You might well be fine with the default scheme it uses.

However, if you are indexing large amounts of data, have specific data retention requirements, or otherwise need to carefully plan your aging policy, you need to read this topic. Also, to back up your data, it helps to know where to find it. So, read on....

How data ages

A bucket moves through several states as it ages:

hot
warm
cold
frozen
thawed

As buckets age, they "roll" from one state to the next. When data is first indexed, it gets written to a hot bucket. Hot buckets are buckets that are actively being written to. An index can have several hot buckets open at a time. Hot buckets are also searchable.

When certain conditions are met (for example, the hot bucket reaches a certain size or the indexer gets restarted), the hot bucket becomes a warm bucket ("rolls to warm"), and a new hot bucket is created in its place. The warm bucket is renamed but it remains in the same location as when it was a hot bucket. Warm buckets are searchable, but they are not actively written to. There can be a large number of warm buckets.

Once further conditions are met (for example, the index reaches some maximum number of warm buckets), the indexer begins to roll the warm buckets to cold, based on their age. It always selects the oldest warm bucket to roll to cold. Buckets continue to roll to cold as they age in this manner. Cold buckets reside in a different location from hot and warm buckets. You can configure the location so that cold buckets reside on cheaper storage.

Finally, after certain other time-based or size-based conditions are met, cold buckets roll to the frozen state, at which point they are deleted from the index, after being optionally archived.

If the frozen data has been archived, it can later be thawed. Data in thawed buckets is available for searches.

Settings in indexes.conf determine when a bucket moves from one state to the next.

Here are the states that buckets age through:

Bucket state	Description	Searchable?
Hot	New data is written to hot buckets. Each index has one or more hot buckets.	Yes
Warm	Buckets rolled from hot. New data is not written to warm buckets. An index has many warm buckets.	Yes
Cold	Buckets rolled from warm and moved to a different location. An index has many cold buckets.	Yes
Frozen	Buckets rolled from cold. The indexer deletes frozen buckets, but you can choose to archive them first. Archived buckets can later be thawed.	No
Thawed	Buckets restored from an archive. If you archive frozen buckets, you can later return them to the index by thawing them.	Yes

Note: For indexes enabled with the SmartStore feature, which places data on a remote store such as S3, the cold state does not ordinarily exist. See Bucket states and SmartStore.

What the index directories look like

Each index occupies its own directory under $SPLUNK_HOME/var/lib/splunk. The name of the directory is the same as the index name. Under the index directory are a series of subdirectories that categorize the buckets by state (hot/warm, cold, or thawed).

Each bucket is a subdirectory within those directories. The bucket names indicate the age of the data they contain.

Here is the directory structure for the default index (defaultdb):

Bucket state	Default location	Notes
Hot	`$SPLUNK_HOME/var/lib/splunk/defaultdb/db/*`	Each hot bucket occupies its own subdirectory.
Warm	`$SPLUNK_HOME/var/lib/splunk/defaultdb/db/*`	Each warm bucket occupies its own subdirectory.
Cold	`$SPLUNK_HOME/var/lib/splunk/defaultdb/colddb/*`	Each cold bucket occupies its own subdirectory. When warm buckets roll to cold, they get moved to this directory.
Frozen	When buckets freeze, they get deleted or archived into a location that you specify.	Deletion is the default. See Archive indexed data for information on how to archive the data instead.
Thawed	`$SPLUNK_HOME/var/lib/splunk/defaultdb/thaweddb/*`	Buckets that are archived and later thawed reside in this directory. See Restore archived data for information on restoring archived data to a thawed state.

The paths for the hot/warm, cold, and thawed directories are configurable. See Configure index storage and Use multiple partitions for index data.

All index locations must be writable.

Note: In pre-6.0 versions of Splunk Enterprise, replicated copies of indexer cluster buckets always resided in the colddb directory, even if they were hot or warm buckets. Starting with 6.0, hot and warm replicated copies reside in the db directory, the same as for non-replicated copies.

Bucket naming conventions

Bucket names depend on:

The state of the bucket: hot or warm/cold/thawed
The type of bucket directory: non-clustered, clustered originating, or clustered replicated

Important: Bucket naming conventions are subject to change.

Non-clustered buckets

A standalone indexer creates non-clustered buckets. These use one type of naming convention.

Clustered buckets

An indexer that is part of an indexer cluster creates clustered buckets. A clustered bucket has multiple exact copies. The naming convention for clustered buckets distinguishes the types of copies, originating or replicated.

Briefly, a bucket in an indexer cluster has multiple copies, according to its replication factor. When the data enters the cluster, the receiving indexer writes the data to a hot bucket. This receiving indexer is known as the source cluster peer, and the bucket where the data gets written is called the originating copy of the bucket.

As data is written to the hot copy, the source peer streams copies of the hot data, in blocks, to other indexers in the cluster. These indexers are referred to as the target peers for the bucket. The copies of the streamed data on the target peers are known as replicated copies of the bucket.

When the source peer rolls its originating hot bucket to warm, the target peers roll their replicated copies of that bucket. The warm copies are exact replicas of each other.

For an introduction to indexer cluster architecture and replicated data streaming, read Basic indexer cluster architecture.

Bucket names

These are the naming conventions:

Bucket type	Hot bucket	Warm/cold/thawed bucket
Non-clustered	`hot_v1_<localid>`	`db_<newest_time>_<oldest_time>_<localid>`
Clustered originating	`hot_v1_<localid>`	`db_<newest_time>_<oldest_time>_<localid>_<guid>`
Clustered replicated	`<localid>_<guid>`	`rb_<newest_time>_<oldest_time>_<localid>_<guid>`

Note:

<newest_time> and <oldest_time> are timestamps indicating the age of the data in the bucket. The timestamps are expressed in UTC epoch time (in seconds). For example: db_1223658000_1223654401_2835 is a warm, non-clustered bucket containing data from October 10, 2008, covering the period of 4pm - 5pm.
<localid> is an ID for the bucket. For a clustered bucket, the originating and replicated copies of the bucket have the same <localid>.
<guid> is the guid of the source peer node. The guid is located in the peer's $SPLUNK_HOME/etc/instance.cfg file.

In an indexer cluster, the originating warm bucket and its replicated copies have identical names, except for the prefix (db for the originating bucket; rb for the replicated copies).

Note: In an indexer cluster, when data is streamed from the source peer to a target peer, the data first goes into a temporary directory on the target peer, identified by the hot bucket convention of <localid>_<guid>. This is true for any replicated bucket copy, whether or not the streaming bucket is a hot bucket. For example, during bucket fix-up activities, a peer might stream a warm bucket to other peers. When the replication of that bucket has completed, the <localid>_<guid> directory is rolled into a warm bucket directory, identified by the rb_ prefix.

Buckets and Splunk Enterprise administration

When you are administering Splunk Enterprise, it helps to understand how the indexer stores indexes across buckets. In particular, several admin activities require a good understanding of buckets:

For information on setting a retirement and archiving policy, see Set a retirement and archiving policy. You can base the retirement policy on either the size or the age of data.

For information on how to archive your indexed data, see Archive indexed data. To learn how to restore data from archive, read Restore archived data.

To learn how to back up your data, read Back up indexed data. That topic also discusses how to manually roll hot buckets to warm, so that you can then back them up.

For information on setting limits on disk usage, see Set limits on disk usage.

For a list of configurable bucket settings, see Configure index storage.

For information on configuring index size, see Configure index size.

For information on partitioning index data, see Use multiple partitions for index data.

For information on how buckets function in indexer clusters, see Buckets and indexer clusters.

For information on buckets and SmartStore, see Bucket states and SmartStore.

In addition, see indexes.conf in the Admin Manual.

Related answers from Splunk Community

How the indexer stores indexes

Why the details might, or might not, matter to you

How data ages

What the index directories look like

Bucket naming conventions

Non-clustered buckets

Clustered buckets

Bucket names

Buckets and Splunk Enterprise administration

Comments

How the indexer stores indexes

Was this topic useful?