Splunk® Enterprise

Managing Indexers and Clusters of Indexers

Download manual as PDF

Splunk Enterprise version 6.x is no longer supported as of October 23, 2019. See the Splunk Software Support Policy for details. For information about upgrading to a supported version, see How to upgrade Splunk Enterprise.
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Download topic as PDF

About archiving indexes with Hadoop Data Roll

As of Splunk Enterprise version 6.5.0, Hadoop Data Roll is included with your Splunk license. Hadoop Data Roll provides a user-friendly way for you to copy warm, cold, and frozen index data as archived data. Archive Splunk indexed data into HDFS or S3 so that you can:

  • Search archived data that is no longer available in Splunk.
  • Search across archived buckets and indexes.
  • Perform batch processing analysis for archived data.
  • Archive indexer data to meet your data retention policies without using valuable indexer space.

Hadoop Data Roll is not supported for Windows

Setting it up

To configure archiving, you tell Splunk Enterprise:

  • Which indexes to archive.
  • Where to put the archived data in HDFS or S3.
  • At what age buckets should be copied to the archive in HDFS.

There are two ways to configure the above information:

System requirements

Make sure you have access to at least one Hadoop cluster (with data in it) and the ability to run MapReduce jobs on that data.

Make sure you have Java 1.6 and above. However, we suggest you upgrade to a version higher than 1.6 for best results.

Hadoop Data Roll is supported on the following Hadoop distributions and versions:

  • Apache Hadoop
    • 0.20
    • 1.0.2
    • 1.0.3
    • 1.0.4
    • 2.4
    • 2.6
    • 2.7
  • Cloudera Distribution Including Apache Hadoop
    • 4
    • 4.2
    • 4.3.0
    • 4.4 (HA NN and HA JT)
    • 5.0
    • 5.3
    • 5.3 (HA)
    • 5.4
    • 5.5
    • 5.6
  • Hortonworks Data Platform (HDP)
    • 1.3
    • 2.0
    • 2.1
    • 2.2
    • 2.3
    • 2.4
  • MapR
    • 2.1
    • 3.0
    • 5.0
  • Amazon Elastic MapReduce (EMR)
  • IBM InfoSphere BigInsights
    • 5.1
  • Pivotal HD

What you need on your Hadoop nodes

On Hadoop TaskTracker nodes you need a directory on the *nix file system running your Hadoop nodes that meets the following requirements:

  • One gigabyte of free disk space for a copy of Splunk.
  • 5-10GB of free disk space for temporary storage. This storage is used by the search processes.

What you need on your Hadoop file system

On your Hadoop file system (HDFS or otherwise) you will need:

  • A subdirectory under jobtracker.staging.root.dir (usually /user/) with the name of the user account under which Splunk Analytics for Hadoop is running on the search head. For example, if Splunk Analytics for Hadoop is started by user "BigDataUser" and jobtracker.staging.root.dir=/user/ you need a directory /user/HadoopAnalytics that is accessible by user "BigDataUser".
  • A subdirectory under the above directory that can be used by this server for intermediate storage, such as /user/hadoopanalytics/server01/

Searching archived indexes

You can search archived buckets as you normally search, simply include the archive virtual index in your searches. See Search archived index data for information about search commands that work with indexes stored in Hadoop.

You can for example, create one search that searches Splunk for:

  • Data in a Splunk Enterprise index.
  • Archived data copied into HDFS or S3.

Search performance

When you search archives, Splunk Enterprise performs batch searches on archived data, which is usually much slower than searches of indexed data. Since Splunk deletes cold data based on your indexes.conf settings, archived could also still be present in Splunk Enterprise indexes. It is important to be familiar with your archive and Splunk indexer retention policies and settings so that if you are looking for specific data that is still in Splunk, you can run more efficient searches.

To improve search time when searching archives, you can use dates to limit the buckets that are searched. The storage path in Splunk Indexer data includes the earliest time and the latest time of the buckets. So when you search within a certain time, Splunk is able to use that information to narrow searches to relevant buckets, rather than searching through the entire archived index.

Last modified on 08 February, 2020
Configuration bundle issues
How Hadoop Data Roll works

This documentation applies to the following versions of Splunk® Enterprise: 6.5.0, 6.5.1, 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.5.10, 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 6.6.6, 6.6.7, 6.6.8, 6.6.9, 6.6.10, 6.6.11, 6.6.12, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 8.0.0, 8.0.1

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters