About archiving indexes with Hadoop Data Roll
Hadoop Data Roll is included with your Splunk license. Hadoop Data Roll provides a user-friendly way for you to copy warm, cold, and frozen index data as archived data. Archive Splunk indexed data into HDFS or S3 so that you can:
- Search archived data that is no longer available in Splunk.
- Search across archived buckets and indexes.
- Perform batch processing analysis for archived data.
- Archive indexer data to meet your data retention policies without using valuable indexer space.
Hadoop Data Roll is not supported for Windows
Setting it up
To configure archiving, you tell Splunk Enterprise:
- Which indexes to archive.
- Where to put the archived data in HDFS or S3.
- At what age buckets should be copied to the archive in HDFS.
There are two ways to configure the above information:
Make sure you have access to at least one Hadoop cluster (with data in it) and the ability to run MapReduce jobs on that data.
Make sure you have Java 1.6 and above. However, we suggest you upgrade to a version higher than 1.6 for best results.
Hadoop Data Roll is supported on the following Hadoop distributions and versions:
- Apache Hadoop 3.2.1
- Open Apache 3.1.2
- Cloudera Distribution including Apache Hadoop v6.3
- Hortonworks Data Platform (HDP) 3.1.4
- MapR 6.1
What you need on your Hadoop nodes
On Hadoop TaskTracker nodes you need a directory on the *nix file system running your Hadoop nodes that meets the following requirements:
- One gigabyte of free disk space for a copy of Splunk.
- 5-10GB of free disk space for temporary storage. This storage is used by the search processes.
What you need on your Hadoop file system
On your Hadoop file system (HDFS or otherwise) you will need:
- A subdirectory under
jobtracker.staging.root.dir(usually /user/) with the name of the user account under which Splunk Analytics for Hadoop is running on the search head. For example, if Splunk Analytics for Hadoop is started by user "BigDataUser" and
jobtracker.staging.root.dir=/user/you need a directory
/user/HadoopAnalyticsthat is accessible by user "BigDataUser".
- A subdirectory under the above directory that can be used by this server for intermediate storage, such as
Searching archived indexes
You can search archived buckets as you normally search, simply include the archive virtual index in your searches. See Search archived index data for information about search commands that work with indexes stored in Hadoop.
You can for example, create one search that searches Splunk for:
- Data in a Splunk Enterprise index.
- Archived data copied into HDFS or S3.
When you search archives, Splunk Enterprise performs batch searches on archived data, which is usually much slower than searches of indexed data. Since Splunk deletes cold data based on your
indexes.conf settings, archived could also still be present in Splunk Enterprise indexes. It is important to be familiar with your archive and Splunk indexer retention policies and settings so that if you are looking for specific data that is still in Splunk, you can run more efficient searches.
To improve search time when searching archives, you can use dates to limit the buckets that are searched. The storage path in Splunk Indexer data includes the earliest time and the latest time of the buckets. So when you search within a certain time, Splunk is able to use that information to narrow searches to relevant buckets, rather than searching through the entire archived index.
Configuration bundle issues
How Hadoop Data Roll works
This documentation applies to the following versions of Splunk® Enterprise: 7.3.4, 7.3.5, 8.0.2, 8.0.3, 8.0.4