How Splunk Analytics for Hadoop returns reports on Hadoop data

Splunk Analytics for Hadoop reaches End of Life on January 31, 2025.

When a search is initiated, Splunk Analytics for Hadoop uses the Hadoop MapReduce framework to process the data in place. All of the data parsing, including source typing, event breaking, and time stamping, that is normally done at index time is performed in Hadoop at search time. Splunk Analytics for Hadoop does not index this data, instead it processes it on every request. Here's an overview of how Splunk Enterprise for Hadoop searches against Hadoop virtual indexes:

1. The user initiates a report-generated search on a virtual index. See Search a virtual index for more information about generating report-generated searches.

2. Splunk Analytics for Hadoop recognizes that the request is for a virtual index and spawns an External Results Provider (ERP) process to help with the request. An ERP is a search helper process that carries out searches on Hadoop data. See About virtual indexes.

3. Based on your configuration, Splunk Analytics for Hadoop passes configuration and run-time data, including the parsed search string etc, to the ERP in a JSON format.

4. If this is the first time a search is executed for a particular provider family, the ERP process sets up the necessary environment in HDFS by copying a Splunk Enterprise package and the knowledge bundles to your HDFS or NoSQL database.

5. The ERP process analyses the request from the search. It identifies the relevant data to be processed and generates tasks to be executed on Hadoop. It then spawns a MapReduce job to perform the computation.

6. For each task, the MapReduce job first makes sure that the environment is up-to-date by checking for the correct Splunk package and knowledge bundle.

7. If the correct package and knowledge bundle are not found, the task copies the Splunk package from HDFS (see step 4) then extracts it into the configured directory. It then copies the bundles from HDFS (see step 4) and expands them in the correct directory within the TaskTracker.

8. The map task spawns a search process on the TaskTracker node to handle all the data processing.

9. The map task feeds data to the search process and it consumes its output, which becomes the output of the map task. This output is stored in HDFS.

10. The ERP processes on the search head continuously polls HDFS to pick up the results and feeds them to the search process running on the search head.

11. The ERP search process on the search head uses these results to create the reports. The report is constantly updated as new data arrives.

Related answers from Splunk Community

How Splunk Analytics for Hadoop returns reports on Hadoop data

Comments

Was this topic useful?