Export to HDFS or a mounted file system
Overview of building an export
To export data from the Splunk platform into Hadoop Distributed File System (HDFS) or mounted file system, you build an export job by first building a search as you would in the Splunk platform using search language.
When you are satisfied with your search string, configure your export format and schedule. After you save the export job, the job periodically runs in the background based on the schedule you configure and exports the data that the search returns.
How data export works
When a job initiates, the Splunk platform looks for a checkpoint that marks the end of the last successful job. The Splunk platform creates a search using the information provided in the scheduled job and the timestamp provided by the checkpoint.
As the job runs, the Splunk platform processes chunks of data received from the search and creates compressed files, locally on the search head. These files are moved to HDFS or the mounted file system if they reach 64MB or if cumulatively they consume more than 1GB, or the search finishes successfully. You can change these limits in the Export Defaults configuration section. While a job is in progress, the compressed export files are temporarily named using an
.hdfs extension. A temporary cursor continues to track the progress of the job.
.hdfs file extension is removed after the export job finishes. At this time the files are permanently renamed. A new checkpoint (cursor) is set to the end of the job, which is where the next job begins.
If the export job fails, the Splunk platform removes the temporary files from HDFS or the mounted file system during the next export attempt. The checkpoint is reset to the original location so that the next job includes the time frame from the failed job.
Create an export job
1. Click Build export in the app navigation bar.
2. Build your search. To create jobs, see the Splunk Search Manual.
You can partition your job while creating the search or you can assign one of several default partitions when you schedule the export in the next steps. See "Creating your own partitions" in this topic.
3. When you create and validate your search, click Create scheduled HDFS export and provide the following information.
- Name: Create a name for the export job.
- Format: The format in which you want the exported fields to be saved: CSV, raw events, XML, or JSON.
- Fields: List the fields you want to include. Use comma-separated values. If you selected "_raw" as the format, this field is grayed out.
- HDFS URI: Choose one of the configured HDFS clusters or mounted file systems to which you want to export the data.
- HDFS base path: Add any subdirectory for the URI.
- Partition by: Select all or any of the default partition types. See "How does partitioning work" in this topic. If you use custom partitions when configuring your search job, those partitions override any selections you make here.
- Export from: The date from which you want to start pulling data. The format is mm/dd/yyyy. You can begin your search job retroactively, as long as the data was in the system at that time.
- Export frequency: How often the search job should be run. Every 5 minutes is the minimum frequency.
- Parallel searches: If you are exporting data from multiple indexers you can improve the export throughput by instructing the Splunk platform to use more than one search to perform the export. The number you specify determines the number of searches that are run in parallel. To run parallel searches on all of your available indexers, type
- Compression Level: Use the slider bar to determine file compression. 0 means no compression and 9 give you the maximum possible compression. The higher the compression level is set, the more slowly files are written.
4. Click Create. The search job starts running on the configured schedule and the information appears in your dashboard.
How does partitioning work?
Partitioning is a process by which the export data is placed in a dynamic directory structure based on the values of certain event fields. You can choose how exported data is partitioned. It can be partitioned by any of the fields present in the event.
Hadoop Connect exposes the following out-of-the-box partitioning variables:
When you are creating an export job, you can select one or more of these partition variables in the user interface.
Creating your own partitions
In addition to the out-of-the-box partitioning variables, you can use any field to compute a partitioning path into the events (results) to be exported. Use the special field
_dstpath. For example, to export your search results into the path <base-path>/<date>/<hour>/<app>, use the following search string:
search ….. | eval _dstpath=strftime(_time, “%Y%m%d/%H”) + “/” + app_name
You can perform other types of preprocessing of data in the Splunk platform (lookups, field extractions, evaluate other fields, and so on) or choose to export it in raw format.
Explore HDFS or a mounted file system
This documentation applies to the following versions of Splunk® Hadoop Connect: 1.2, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5