How Hadoop Data Roll works
Beginning with version 8.3.0, Hadoop Data Roll works with buckets with journalCompression set to zstd.
After you configure an index as an Archive, a number of processes work to move aged data into archived indexes:
1. A saved search
| archivebuckets automatically runs once an hour on the search head. This is a custom command packaged with the
archiver which is implemented as the Python script
archivebuckets queries the local REST endpoints to discover which indexes should be archived, and where to archive the indexes.
archivebuckets copies the Hadoop Data Roll jars into its own app directory, then launches distributed searches for each provider for the indexes to be archived.
The search used in this step is
| copybuckets, which is a custom command automatically implemented by
4. The information for the index and its provider is fed to the search.
5. For each indexer, Splunk Enterprise copies the knowledge bundle needed to run the search.
6. On the indexer,
copybuckets launches a Java process, with the same entry point (the SplunkMR class) used for Splunk Analytics for Hadoop searches.
7. Splunk Enterprise passes the info about providers and indexes to the Java process using
8. When the Java process sends events back to Splunk Enterprise, it writes them to
stdout, and the custom search command (
copybuckets.py) writes them back to the search process using
9. The Java process logs these actions to the
10. The Java process checks all buckets in the designated indexes. If the buckets are ready to be archived, the process determines whether the buckets already exist in the archive. It accesses the archive using the provider information.
11. If the bucket has not yet been archived, the bucket is copied to a temporary directory at the archive. Once the bucket it is completely copied, and a receipt file added, it is moved to the correct folder in the archive.
12. If the bucket is previously archived, any new data that has reached it's archived date is copied into that bucket.
13. Archived buckets are ready to be searched in Splunk Web.
About the Hadoop Data Roll processes
Two search commands are defined to correspond with python scripts:
archivebuckets -> archivebuckets.py
copybuckets -> copybuckets.py
The implementation of the Hadoop Data Roll process uses the following processes:
|process action||process name||notes|
|search process on the Search Head/Search Scheduler||archivebuckets||This is the search activity, including scheduling, that occurs on the Search Head|
|Python process on the Search Head||archivebuckets.py|
|Search process on an indexer||copybuckets <JSON describing indexes>||This is all of the Search Activity that occurs in the Indexer.|
|Python process on an indexer||copybuckets.py|
|Java Virtual Machine process on an indexer||Hunk Java code||This process ties the other processes together, and does the following:
1. Writes files to HDFS
2. logs information to
3. Writes events to
4. Information written in the Splunk Search process becomes an event returned by the search, and you can see these events in Splunk Enterprise with the search command
How the processes work together
The Hadoop Data Roll search framework strings these processes together and pipes them as follows:
stdout of the python process on an indexer
stdin for the search process on the indexer
| copybuckets <JSON describing indexes> to
stdin to python process on the Search Head
stdout of archivebuckets.py into search scheduler for the search process on the Search Head
4. The python process on the Search Head
copybuckets.py script pipes
stdout of Hadoop Data Roll Java code (Java Virtual Machine process on an indexer) to
stdout to the search process on the indexer
copybuckets <JSON describing indexes>.
At the end of this process, anything that the Hadoop Data Roll Java code (JVM process on an indexer) writes to
stdout becomes events returned from the search scheduler
Finalizing or aborting archiving process
When Hadoop Data Roll pauses or finalizes a search, this information must be passed to downstream processes.
For example, if the search process on an indexer shuts down, the search could kill the child process, which then prevents the indexer Python process from shutting shut down gracefully. If the Python process is using a shared resource such as a database connection, or an output stream to HDFS, this could cause failure and possible loss of data.
To resolve this, the search process lets the child process decide what to do should the search process suspend or shut down. If the search process on an indexer is paused, it stops reading from its pipe to the Python process on an indexer. When this happens, the Python process on the indexer can no longer write to the pipe once the buffer fills up.
The Python process is able to determine that the search process on an indexer still exists, but is paused.
If the search process on an indexer is stopped/finalized, it shuts down and the pipe to the Python search process is broken. This is how the Splunk custom search commands know that the upstream search has stopped running. This occurs whether the search is shut down cleanly due to user action, or shut down violently due to an upstream crash.
If the archiving Java process in the indexer finds a broken pipe to the indexer search process, it logs that information, but continues to finish archiving until the buffer is full. if this is not desired, simply kill the Java process.
About archiving indexes with Hadoop Data Roll
Add or edit an HDFS provider in Splunk Web
This documentation applies to the following versions of Splunk® Enterprise: 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4
Feedback submitted, thanks!