How Splunk returns reports on Hadoop data

When a search is initiated, Hunk uses the Hadoop MapReduce framework to process the data in place. All of the data parsing, including source typing, event breaking, and time stamping, that is normally done at index time is performed in Hadoop at search time. Hunk does not index this data. It processes it on every request. Here's a high level overview of how searches against Hadoop virtual indexes operate:

1. The user initiates a report-generated search on a virtual index. See "Search a virtual index" for more information about generating report-generated searches.

2. Hunk recognizes that the request is for a virtual index, and rather than searching a local index, Hunk spawns an External Provider Resource (ERP) process to help with the request. An ERP is a search helper process that carries out searches on Hadoop data. See "About virtual indexes."

3. Based on your configuration, Hunk passes configuration and run-time data, including the parsed search string etc, to the ERP in a JSON format.

4. If this is the first time a search is executed for a particular provider family, the ERP process sets up the necessary Hunk environment in HDFS by copying a Hunk package and the knowledge bundles to your HDFS or NoSQL database.

5. The ERP process analyses the request from the Hunk search. It identifies the relevant data to be processed and generates tasks to be executed on Hadoop. It then spawns a MapReduce job to perform the computation.

6. For each task, the MapReduce job first makes sure that the Hunk environment is up-to-date by checking for the correct Splunk package and knowledge bundle.

7. If the correct package and knowledge bundle are not found, the task copies the Splunk package from HDFS (see step 4) then extracts it into the configured directory. It then copies the bundles from HDFS (see step 4) and expands them in the correct directory within the TaskTracker.

8. The map task spawns a Hunk search process on the TaskTracker node to handle all the data processing.

9. The map task feeds data to the Hunk search process and it consumes its output, which becomes the output of the map task. This output is stored in HDFS.

10. The ERP processes on the search head continuously poll HDFS to pick up the results and feeds them to the search process running on the search head.

11. The Hunk search process on the search head uses these results to create the reports. The report is constantly updated as new data arrives.

How Splunk returns reports on Hadoop data

Comments

Was this topic useful?