Splunk® Enterprise

Managing Indexers and Clusters of Indexers

Download manual as PDF

Download topic as PDF

How the indexer stores indexes

As the indexer indexes your data, it creates a number of files:

Together, these files constitute the Splunk Enterprise index. The files reside in sets of directories organized by age. Some directories contain newly indexed data; others contain previously indexed data. The number of such directories can grow quite large, depending on how much data you're indexing.

Why you might care

You might not care, actually. The indexer handles indexed data by default in a way that gracefully ages the data through several states. After a long period of time, typically several years, the indexer removes old data from your system. You might well be fine with the default scheme it uses.

However, if you're indexing large amounts of data, have specific data retention requirements, or otherwise need to carefully plan your aging policy, you've got to read this topic. Also, to back up your data, it helps to know where to find it. So, read on....

How data ages

Each of the index directories is known as a bucket. To summarize so far:

  • An "index" contains compressed raw data and associated index files.
  • An index resides across many age-designated index directories.
  • An index directory is called a bucket.

A bucket moves through several states as it ages:

  • hot
  • warm
  • cold
  • frozen
  • thawed

As buckets age, they "roll" from one state to the next. As data is indexed, it goes into a hot bucket. Hot buckets are both searchable and actively being written to. An index can have several hot buckets open at a time.

When certain conditions occur (for example, the hot bucket reaches a certain size or splunkd gets restarted), the hot bucket becomes a warm bucket ("rolls to warm"), and a new hot bucket is created in its place. Warm buckets are searchable, but are not actively written to. There are many warm buckets.

Once further conditions are met (for example, the index reaches some maximum number of warm buckets), the indexer begins to roll the warm buckets to cold, based on their age. It always selects the oldest warm bucket to roll to cold. Buckets continue to roll to cold as they age in this manner. After a set period of time, cold buckets roll to frozen, at which point they are either archived or deleted. By editing attributes in indexes.conf, you can specify the bucket aging policy, which determines when a bucket moves from one state to the next.

If the frozen data has been archived, it can later be thawed. Thawed data is available for searches.

Here are the states that buckets age through:

Bucket state Description Searchable?
Hot Contains newly indexed data. Open for writing. One or more hot buckets for each index. Yes
Warm Data rolled from hot. There are many warm buckets. Data is not actively written to warm buckets. Yes
Cold Data rolled from warm. There are many cold buckets. Yes
Frozen Data rolled from cold. The indexer deletes frozen data by default, but you can choose to archive it instead. Archived data can later be thawed. No
Thawed Data restored from an archive. If you archive frozen data, you can later return it to the index by thawing it. Yes

The collection of buckets in a particular state is sometimes referred to as a database or "db": the "hot db", the "warm db", the "cold db", etc.

What the index directories look like

Each index occupies its own directory under $SPLUNK_HOME/var/lib/splunk. The name of the directory is the same as the index name. Under the index directory are a series of subdirectories that categorize the buckets by state (hot/warm, cold, or thawed).

The buckets themselves are subdirectories within those directories. The bucket directory names are based on the age of the data.

Here is the directory structure for the default index (defaultdb):

Bucket state Default location Notes
Hot $SPLUNK_HOME/var/lib/splunk/defaultdb/db/* There can be multiple hot subdirectories, one for each hot bucket. See Bucket naming conventions.
Warm $SPLUNK_HOME/var/lib/splunk/defaultdb/db/* There are separate subdirectories for each warm bucket. See Bucket naming conventions.
Cold $SPLUNK_HOME/var/lib/splunk/defaultdb/colddb/* There are multiple cold subdirectories. When warm buckets roll to cold, they get moved into this directory, but are not renamed.
Frozen N/A: Frozen data gets deleted or archived into a directory location you specify. Deletion is the default; see Archive indexed data for information on how to archive the data instead.
Thawed $SPLUNK_HOME/var/lib/splunk/defaultdb/thaweddb/* Location for data that has been archived and later thawed. See Restore archived data for information on restoring archived data to a thawed state.

The paths for the hot/warm, cold, and thawed directories are configurable, so, for example, you can store cold buckets in a separate location from hot/warm buckets. See Configure index storage and Use multiple partitions for index data.

All index locations must be writable.

Note: In pre-6.0 versions of Splunk Enterprise, replicated copies of indexer cluster buckets always resided in the colddb directory, even if they were hot or warm buckets. Starting with 6.0, hot and warm replicated copies reside in the db directory, the same as for non-replicated copies.

Bucket states and SmartStore

For indexes eanbled with the SmartStore feature, which places data on a remote store such as S3, the cold state does not ordinarily exist. See Bucket states and SmartStore.

Bucket naming conventions

Bucket names depend on:

  • The state of the bucket: hot or warm/cold/thawed
  • The type of bucket directory: non-clustered, clustered originating, or clustered replicated

Important: Bucket naming conventions are subject to change.

Non-clustered buckets

A standalone indexer creates non-clustered buckets. These use one type of naming convention.

Clustered buckets

An indexer that is part of an indexer cluster creates clustered buckets. A clustered bucket has multiple exact copies. The naming convention for clustered buckets distinguishes the types of copies, originating or replicated.

Briefly, a bucket in an indexer cluster has multiple copies, according to its replication factor. When the data enters the cluster, the receiving indexer writes the data to a hot bucket. This receiving indexer is known as the source cluster peer, and the bucket where the data gets written is called the originating copy of the bucket.

As data is written to the hot copy, the source peer streams copies of the hot data, in blocks, to other indexers in the cluster. These indexers are referred to as the target peers for the bucket. The copies of the streamed data on the target peers are known as replicated copies of the bucket.

When the source peer rolls its originating hot bucket to warm, the target peers roll their replicated copies of that bucket. The warm copies are exact replicas of each other.

For an introduction to indexer cluster architecture and replicated data streaming, read Basic indexer cluster architecture.

Bucket names

These are the naming conventions:

Bucket type Hot bucket Warm/cold/thawed bucket
Non-clustered hot_v1_<localid> db_<newest_time>_<oldest_time>_<localid>
Clustered originating hot_v1_<localid> db_<newest_time>_<oldest_time>_<localid>_<guid>
Clustered replicated <localid>_<guid> rb_<newest_time>_<oldest_time>_<localid>_<guid>

Note:

  • <newest_time> and <oldest_time> are timestamps indicating the age of the data in the bucket. The timestamps are expressed in UTC epoch time (in seconds). For example: db_1223658000_1223654401_2835 is a warm, non-clustered bucket containing data from October 10, 2008, covering the exact period of 9am-10am.
  • <localid> is an ID for the bucket. For a clustered bucket, the originating and replicated copies of the bucket have the same <localid>.
  • <guid> is the guid of the source peer node. The guid is located in the peer's $SPLUNK_HOME/etc/instance.cfg file.

In an indexer cluster, the originating warm bucket and its replicated copies have identical names, except for the prefix (db for the originating bucket; rb for the replicated copies).

Note: In an indexer cluster, when data is streamed from the source peer to a target peer, the data first goes into a temporary directory on the target peer, identified by the hot bucket convention of <localid>_<guid>. This is true for any replicated bucket copy, whether or not the streaming bucket is a hot bucket. For example, during bucket fix-up activities, a peer might stream a warm bucket to other peers. When the replication of that bucket has completed, the <localid>_<guid> directory is rolled into a warm bucket directory, identified by the rb_ prefix.

Buckets and Splunk Enterprise administration

When you are administering Splunk Enterprise, it helps to understand how the indexer stores indexes across buckets. In particular, several admin activities require a good understanding of buckets:

  • To learn how to back up your data, read Back up indexed data. That topic also discusses how to manually roll hot buckets to warm, so that you can then back them up.

In addition, see indexes.conf in the Admin Manual.

PREVIOUS
Use the monitoring console to view indexing performance
  NEXT
Configure index storage

This documentation applies to the following versions of Splunk® Enterprise: 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters