About archiving indexes with Hadoop Data Roll

As of Splunk Enterprise version 6.5.0, Hadoop Data Roll is included with your Splunk license. Hadoop Data Roll provides a user-friendly way for you to copy warm, cold, and frozen index data as archived data. Archive Splunk indexed data into HDFS or S3 so that you can:

Search archived data that is no longer available in Splunk.
Search across archived buckets and indexes.
Perform batch processing analysis for archived data.
Archive indexer data to meet your data retention policies without using valuable indexer space.

Hadoop Data Roll is not supported for Windows

Setting it up

To configure archiving, you tell Splunk Enterprise:

Which indexes to archive.
Where to put the archived data in HDFS or S3.
At what age buckets should be copied to the archive in HDFS.

There are two ways to configure the above information:

System requirements

Make sure you have access to at least one Hadoop cluster (with data in it) and the ability to run MapReduce jobs on that data.

Make sure you have Java 1.6 and above. However, we suggest you upgrade to a version higher than 1.6 for best results.

Hadoop Data Roll is supported on the following Hadoop distributions and versions:

Apache Hadoop
- 0.20
- 1.0.2
- 1.0.3
- 1.0.4
- 2.4
- 2.6
- 2.7
Cloudera Distribution Including Apache Hadoop
- 4
- 4.2
- 4.3.0
- 4.4 (HA NN and HA JT)
- 5.0
- 5.3
- 5.3 (HA)
- 5.4
- 5.5
- 5.6
Hortonworks Data Platform (HDP)
- 1.3
- 2.0
- 2.1
- 2.2
- 2.3
- 2.4
MapR
- 2.1
- 3.0
- 5.0
Amazon Elastic MapReduce (EMR)
IBM InfoSphere BigInsights
- 5.1
Pivotal HD

What you need on your Hadoop nodes

On Hadoop TaskTracker nodes you need a directory on the *nix file system running your Hadoop nodes that meets the following requirements:

One gigabyte of free disk space for a copy of Splunk.
5-10GB of free disk space for temporary storage. This storage is used by the search processes.

What you need on your Hadoop file system

On your Hadoop file system (HDFS or otherwise) you will need:

A subdirectory under jobtracker.staging.root.dir (usually /user/) with the name of the user account under which Splunk Analytics for Hadoop is running on the search head. For example, if Splunk Analytics for Hadoop is started by user "BigDataUser" and jobtracker.staging.root.dir=/user/ you need a directory /user/HadoopAnalytics that is accessible by user "BigDataUser".

A subdirectory under the above directory that can be used by this server for intermediate storage, such as /user/hadoopanalytics/server01/

Searching archived indexes

You can search archived buckets as you normally search, simply include the archive virtual index in your searches. See Search archived index data for information about search commands that work with indexes stored in Hadoop.

You can for example, create one search that searches Splunk for:

Data in a Splunk Enterprise index.
Archived data copied into HDFS or S3.

Search performance

When you search archives, Splunk Enterprise performs batch searches on archived data, which is usually much slower than searches of indexed data. Since Splunk deletes cold data based on your indexes.conf settings, archived could also still be present in Splunk Enterprise indexes. It is important to be familiar with your archive and Splunk indexer retention policies and settings so that if you are looking for specific data that is still in Splunk, you can run more efficient searches.

To improve search time when searching archives, you can use dates to limit the buckets that are searched. The storage path in Splunk Indexer data includes the earliest time and the latest time of the buckets. So when you search within a certain time, Splunk is able to use that information to narrow searches to relevant buckets, rather than searching through the entire archived index.

Related answers from Splunk Community

About archiving indexes with Hadoop Data Roll

Setting it up

System requirements

What you need on your Hadoop nodes

What you need on your Hadoop file system

Searching archived indexes

Search performance

Comments

About archiving indexes with Hadoop Data Roll

Was this topic useful?