Archive Splunk indexes to Hadoop on S3
For best performance and to avoid bucket size limits, you should use the S3A filesystem that was introduced in Apache Hadoop 2.6.0. Configuration for different Hadoop distribution may differ.
Configure archiving to Amazon S3
To enable Splunk Enterprise to archive Splunk data to S3:
1. Add the following configuration to your providers:
vix.env.HADOOP_HOME: /absolute/path/to/apache/hadoop-2.6.0 vix.fs.s3a.access.key: <AWS access key> vix.fs.s3a.secret.key: <AWS secret key> vix.env.HADOOP_TOOLS: $HADOOP_HOME/share/hadoop/tools/lib vix.splunk.jars: $HADOOP_TOOLS/hadoop-aws-2.6.0.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar
2. Using the above configuration, now create an archive index with path prefixed with
In this example,
s3ais the implementation Hadoop will use to transfer and read files from the supplied path
bucketis the name of your S3 bucket
/path/to/archiveare directories within the bucket
Further configuration for unique setups
You may need to further configure Splunk Enterprise to search S3 archives depending on the specifics of your configuration.
If you are using a search head exclusively
If you're just using a search head to search your archives, then set the provider's
vix.mode attribute to
vix.mode = stream
vix.mode is set to
stream, Splunk Enterprise streams all the data the search matches to the search head, and will not spawn MapReduce jobs on Hadoop.
If you have configured a search head with a Hadoop cluster
If the Hadoop version for search head archive indexes is compatible with your Hadoop cluster, no additional configuration is necessary to search your archive indexes. Just go to the Splunk Web search bar and enter:
The search head will spawn Hadoop MapReduce jobs against your archive when it's appropriate to do so.
If your Hadoop cluster version is not compatible with your Hadoop Home version
You can still use Data Roll if your Hadoop cluster is not compatible with your Hadoop client libraries (that have the S3a filesystem). An example of this is if you are using Apache Hadoop 2.6.0 for archiving, but you are using Hadoop 1.2.0 for your Hadoop cluster. To do this, use the older S3n filesystem to search your archives.
To configure S3n and an older Hadoop cluster to search your archives:
1. Configure a Provider for your Hadoop cluster.
2. For every archive index, configure
indexes.conf from your terminal and add a new virtual index with these properties:
[<virtual_index_name_of_your_choosing>] vix.output.buckets.path = <archive_index_destination_path_with_s3n_instead_of_s3a> vix.provider = <hadoop_cluster_provider>
3. Make sure the
vix.output.buckets.path is S3n so that Splunk Enterprise can search using the older filesystem to search your archives..
For example. Given an archive index named "main_archive", destination path "s3a://my-bucket/archive_root/main_archive" and provider = "hadoop_cluster", you should configure a virtual index like this:
vix.output.buckets.path = s3n://my-bucket/archive_root/main_archive
vix.provider = hadoop_cluster
Known Issues with S3
When using Hadoop's S3N filesystem, you're limited to uploading files that are less than 5GB. While it's rare, it's possible that Splunk buckets become larger than 5GB. These buckets would not get archived when using the S3N filesystem.
If you use the S3N filesystem, configure your indexes to roll buckets from hot to warm at a size less than 5 GB via the
maxDataSize attribute in
Data Roll archiving requires, at a minimum, read-after-write consistency. For the US Standard region, S3 only guarantees this when accessed via the Northern Virginia endpoint. See the Amazon AWS S3 FAQ for more details.
For more information about archiving with S3a, see <a href="http://blogs.splunk.com/2015/02/11/faster-and-limitless-hunk-archiving-to-s3-with-hadoop-2-6-0/">this blog about faster and limitless archiving with S3A</a>.
Bucket raw data limit
Because of how Hadoop interacts with the S3 file system, Splunk Enterprise cannot currently archive buckets with raw data sets larger than 5GB to S3.
We recommend that you use a S3FileSystem implementation that supports uploads larger than 5GB. To ensure that all your data is archived, configure your indexes to roll buckets from hot to warm at a size less than 5 GB via the
maxDataSize attribute in
Data copy process
When archiving to S3, data is copied twice. This is because S3 does not support file renaming and the FileSystem implements file rename as follows:
- Download the file
- Upload it renamed
- Delete the original file
This process does not create duplicate data in your archive.
Bandwidth throttling limitations
Splunk Enterprise cannot guarantee that bandwidth throttling will be respected when archiving to S3. Splunk will still attempt to throttle bandwidth where possible, if configured to do so.
Archive Splunk indexes to Hadoop in Splunk Web
Search indexed data archived to Hadoop
This documentation applies to the following versions of Splunk® Enterprise: 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 6.6.6, 6.6.7, 6.6.8, 6.6.9, 6.6.10, 6.6.11, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.2.0, 7.2.1