How Hadoop Data Roll works

Beginning with version 8.3.0, Hadoop Data Roll works with buckets with journalCompression set to zstd.

After you configure an index as an Archive, a number of processes work to move aged data into archived indexes:

1. A saved search | archivebuckets automatically runs once an hour on the search head. This is a custom command packaged with the archiver which is implemented as the Python script archivebuckets.py.

2. archivebuckets queries the local REST endpoints to discover which indexes should be archived, and where to archive the indexes.

3. archivebuckets copies the Hadoop Data Roll jars into its own app directory, then launches distributed searches for each provider for the indexes to be archived.

The search used in this step is | copybuckets, which is a custom command automatically implemented by copybuckets.py.

4. The information for the index and its provider is fed to the search.

5. For each indexer, Splunk Enterprise copies the knowledge bundle needed to run the search.

6. On the indexer, copybuckets launches a Java process, with the same entry point (the SplunkMR class) used for Splunk Analytics for Hadoop searches.

7. Splunk Enterprise passes the info about providers and indexes to the Java process using stdin.

8. When the Java process sends events back to Splunk Enterprise, it writes them to stdout, and the custom search command (copybuckets.py) writes them back to the search process using stdout.

9. The Java process logs these actions to the splunk_archiver.log file.

10. The Java process checks all buckets in the designated indexes. If the buckets are ready to be archived, the process determines whether the buckets already exist in the archive. It accesses the archive using the provider information.

11. If the bucket has not yet been archived, the bucket is copied to a temporary directory at the archive. Once the bucket it is completely copied, and a receipt file added, it is moved to the correct folder in the archive.

12. If the bucket is previously archived, any new data that has reached it's archived date is copied into that bucket.

13. Archived buckets are ready to be searched in Splunk Web.

About the Hadoop Data Roll processes

Two search commands are defined to correspond with python scripts:

archivebuckets -> archivebuckets.py

copybuckets -> copybuckets.py

The implementation of the Hadoop Data Roll process uses the following processes:

process action	process name	notes
search process on the Search Head/Search Scheduler	archivebuckets	This is the search activity, including scheduling, that occurs on the Search Head
Python process on the Search Head	archivebuckets.py
Search process on an indexer	copybuckets <JSON describing indexes>	This is all of the Search Activity that occurs in the Indexer.
Python process on an indexer	copybuckets.py
Java Virtual Machine process on an indexer	Hunk Java code	This process ties the other processes together, and does the following: 1. Writes files to HDFS 2. logs information to `$SPLUNK_HOME/var/log/splunk/splunk_archiver.log` 3. Writes events to `stdout`, which is piped back to the Splunk search process `\| copybuckets <JSON describing indexes>` 4. Information written in the Splunk Search process becomes an event returned by the search, and you can see these events in Splunk Enterprise with the search command `\| archivebuckets forcerun=1`.

How the processes work together

The Hadoop Data Roll search framework strings these processes together and pipes them as follows:

1. stdout of the python process on an indexer copybuckets.py to stdin for the search process on the indexer archivebuckets.

2. stdout of | copybuckets <JSON describing indexes> to stdin to python process on the Search Head archivebuckets.py.

3. stdout of archivebuckets.py into search scheduler for the search process on the Search Head archivebuckets.

4. The python process on the Search Head copybuckets.py script pipes stdout of Hadoop Data Roll Java code (Java Virtual Machine process on an indexer) to stdout to the search process on the indexer copybuckets <JSON describing indexes>.

At the end of this process, anything that the Hadoop Data Roll Java code (JVM process on an indexer) writes to stdout becomes events returned from the search scheduler | archivebuckets.

Finalizing or aborting archiving process

When Hadoop Data Roll pauses or finalizes a search, this information must be passed to downstream processes.

For example, if the search process on an indexer shuts down, the search could kill the child process, which then prevents the indexer Python process from shutting shut down gracefully. If the Python process is using a shared resource such as a database connection, or an output stream to HDFS, this could cause failure and possible loss of data.

To resolve this, the search process lets the child process decide what to do should the search process suspend or shut down. If the search process on an indexer is paused, it stops reading from its pipe to the Python process on an indexer. When this happens, the Python process on the indexer can no longer write to the pipe once the buffer fills up.

The Python process is able to determine that the search process on an indexer still exists, but is paused.

If the search process on an indexer is stopped/finalized, it shuts down and the pipe to the Python search process is broken. This is how the Splunk custom search commands know that the upstream search has stopped running. This occurs whether the search is shut down cleanly due to user action, or shut down violently due to an upstream crash.

If the archiving Java process in the indexer finds a broken pipe to the indexer search process, it logs that information, but continues to finish archiving until the buffer is full. if this is not desired, simply kill the Java process.

Related answers from Splunk Community

How Hadoop Data Roll works

About the Hadoop Data Roll processes

How the processes work together

Finalizing or aborting archiving process

Comments

How Hadoop Data Roll works

Was this topic useful?