Back up indexed data
To decide how to back up indexed data, it helps to understand first how Splunk stores data and how the data ages once it's in Splunk. Then you can decide on a backup strategy.
Before you read this topic, you should look at "How Splunk stores indexes" to get familiar with the structure of indexes and the options for configuring them. But if you want to jump right in, the next section below attempts to summarize the key points from that topic.
How data ages
Indexed data resides in database directories consisting of subdirectories called buckets. Each index has its own set of databases.
As data ages, it moves through several types of buckets. You determine how the data ages by by configuring attributes in indexes.conf. Read "Configure index storage" for a description of the settings in
indexes.conf that control how data ages.
Briefly, here's a somewhat simplified version of how data ages in a Splunk index:
1. When Splunk first indexes data, it goes into a "hot" bucket. Depending on your configuration, there can be several hot buckets open at one time. Hot buckets cannot be backed up because Splunk is actively writing to them, but you can take a snapshot of them.
2. The data remains in the hot bucket until the policy conditions are met for it to be reclassified as "warm" data. This is called "rolling" the data into the warm bucket. This happens when a hot bucket reaches a specified size or age, or whenever
splunkd gets restarted. When a hot bucket is rolled, its directory is renamed, and it becomes a warm bucket. (You can also manually roll a bucket from hot to warm, as described as described below.) It is safe to back up the warm buckets.
3. When the index reaches one of several possible configurable limits, usually a specified number of warm buckets, the oldest bucket becomes a "cold" bucket. Splunk moves the bucket to the
colddb directory. The default number of warm buckets is 300.
4. Finally, at a time based on your defined policy requirements, the bucket rolls from cold to "frozen". Splunk deletes frozen buckets. However, if you need to preserve the data, you can tell Splunk to archive the data before deleting the bucket. See "Archive indexed data" for more information.
You can set retirement and archiving policy by controlling several different parameters, such as the size of indexes or buckets or the age of the data.
- hot buckets - Currently being written to; do not back these up.
- warm buckets - Rolled from hot; can be safely backed up.
- cold buckets - Rolled from warm; buckets are moved to another location.
- frozen buckets - Splunk deletes these, but you can archive their contents first.
You set the locations of index databases in
indexes.conf. (See below for detailed information on the database locations for the default index.) You also specify numerous other attributes there, such as the maximum size and age of hot buckets.
Locations of the index database directories
Here's the directory structure for the default index (
|Bucket type||Default location||Notes|
|| There can be multiple hot subdirectories. Each hot bucket occupies its own subdirectory, which uses this naming convention:
||There are separate subdirectories for each warm bucket. These are named as described below in "Warm/cold bucket naming convention".|
||There are multiple cold subdirectories. When warm buckets roll to cold, they get moved into this directory, but are not renamed.
Note: In a cluster, all replicated copies of buckets reside in the
|Frozen||N/A: Frozen data gets deleted or archived into a directory location you specify.||Deletion is the default; see "Archive indexed data" for information on how to archive the data instead.|
||Location for data that has been archived and later thawed. See "Restore archived data" for information on restoring archived data to a thawed state.|
The paths for hot/warm and cold directories are configurable, so you can store cold buckets in a separate location from hot/warm buckets. See "Configure index storage" and "Use multiple partitions for index data".
Important: All index locations must be writable.
Choose your backup strategy
There are two basic backup scenarios to consider:
- Ongoing, incremental backups of warm data
- Backup of all data - for example, before upgrading Splunk
How you actually perform the backup will, of course, depend entirely on the tools and procedures in place at your organzation, but this section should help provide you the guidelines you need to proceed.
The general recommendation is to schedule backups of any new warm buckets regularly, using the incremental backup utility of your choice. If you're rolling buckets frequently, you should also include the cold database directory in your backups, to ensure that you don't miss any buckets that have rolled to cold before they've been backed up. Since bucket directory names don't change when they roll from warm to cold, you can just filter by name.
To back up hot buckets as well, you need to take a snapshot of the files, using a tool like VSS (on Windows/NTFS), ZFS snapshots (on ZFS), or a snapshot facility provided by the storage subsystem. If you do not have a snapshot tool available, you can manually roll a hot bucket to warm and then back it up, as described below. However, this is not generally recommended, for reasons also discussed below.
Back up all data
It is recommended that you back up all your data before upgrading Splunk. This means the hot, warm, and cold buckets.
There are obviously a number of ways to do this, depending on the size of your data and how much downtime you can afford. Here are some basic guidelines:
- For smaller amounts of data, shut down Splunk and just make a copy of your database directories before performing the upgrade.
- For larger amounts of data, you will probably instead want to snapshot your hot buckets prior to upgrade.
In any case, if you have been doing incremental backups of your warm buckets as they've rolled from hot, you should really need to backup only your hot buckets at this time.
Rolling buckets manually from hot to warm
To roll the buckets of an index manually from hot to warm, use the following CLI command, replacing
<index_name> with the name of the index you want to roll:
splunk _internal call /data/indexes/<index_name>/roll-hot-buckets –auth <admin_username>:<admin_password>
Important: It is ordinarily not advisable to roll hot buckets manually, as each forced roll permanently decreases search performance over the data. As a general rule, larger buckets are more efficient to search. By prematurely rolling buckets, you're producing smaller, less efficient buckets. In cases where hot data needs to be backed up, a snapshot backup is the preferred method.
Recommendations for recovery
If you experience a non-catastrophic disk failure (for example you still have some of your data, but Splunk won't run), Splunk recommends that you move the index directory aside and restore from a backup rather than restoring on top of a partially corrupted datastore. Splunk will automatically create hot directories on startup as necessary and resume indexing. Monitored files and directories will pick up where they were at the time of the backup.
Index backup strategy
For an end-to-end procedure that ensures all data in your index gets backed up on a daily basis, read this Splunk blog: Index backup strategy.
Clustered data backups
Even though a cluster already contains redundant copies of data, you might also want to back up the cluster data to another location; for example, to keep a copy of the data offsite as part of an overall disaster recovery plan.
The simplest way to do this is to back up the data on each individual peer node on your cluster, in the same way that you back up data on individual, non-clustered indexers, as described earlier in this topic. However, this approach will result in backups of duplicate data. For example, if you have a cluster with a replication factor of 3, the cluster is storing three copies of all the data across its set of peer nodes. If you then back up the data residing on each individual node, you end up with backups containing, in total, three copies of the data. You cannot solve this problem by backing up just the data on a single node, since there's no certainty that a single node contains all the data in the cluster.
The solution to this would be to identify exactly one copy of each bucket on the cluster and then back up just those copies. However, in practice, it is quite a complex matter to do that. One approach is to create a script that goes through each peer's index storage and uses the bucket ID value contained in the bucket name to identify exactly one copy of each bucket. The bucket ID is the same for all copies of a bucket. For information on the bucket ID, read "Warm/cold bucket naming convention". Another thing to consider when designing a cluster backup script is whether you want to back up just the bucket's rawdata or both its rawdata and index files. If the latter, the script must also identify a searchable copy of each bucket.
Because of the complications of cluster backup, it is recommended that you contact Splunk Professional Services for guidance in backing up single copies of clustered data. They can help design a solution customized to the needs of your environment.
Configure bloom filters
Set a retirement and archiving policy
This documentation applies to the following versions of Splunk® Enterprise: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5, 5.0.6, 5.0.7, 5.0.8, 5.0.9, 5.0.10, 5.0.11, 5.0.12, 5.0.13, 5.0.14, 5.0.15, 5.0.16, 5.0.17, 5.0.18