Splunk® Hadoop Connect

Deploy and Use Splunk Hadoop Connect

Download manual as PDF

Download topic as PDF

Configuration file reference

Splunk Hadoop Connect uses several configuration files that help control how it operates in your Splunk environment. Splunk Hadoop Connect makes changes to these files when you change your configuration from within Splunk Web.

You can also edit them manually. Whenever you make changes to a configuration file manually, always make the changes in $SPLUNK_HOME/etc/apps/HadoopConnect/local. Do not add files to or edit files in the default directory.

For information on how configuration files work in the Splunk platform, see "About configuration files."

Splunk Hadoop Connect uses the following configuration files.

clusters.conf

The clusters.conf file defines:

  • Hadoop Distributed File System (HDFS) cluster information for Splunk Hadoop Connect to interface with when importing or exporting data to HDFS.
  • Local file system information for Splunk Hadoop Connect to interface with when exporting data.

The file contains at least one stanza

[<host>:<ipc port>]

  • For a remote HDFS cluster, the stanza name should contain the host and the ipc port that the Hadoop cluster is running. For example:

[hadoop.example.com:8020]

  • For a locally mounted file system the uri must contain the host name, or IP address, and inter-process communication (IPC) port, as follows:

[<hostname or ip address>:<IPC port>]

Within the stanza, the configuration file supports the following attributes:

Attribute Type Description Default
namenode_http_port string The TCP port on which the NameNode (the centerpiece of the HDFS system) listens for HTTP requests. 50070
uri string If mapping to a mounted file system, the full path to to that file system.
hadoop_home string The path to the Hadoop command line utilities that Splunk Hadoop Connect should use when it communicates with the cluster.

If you use the string '$HADOOP_HOME', and the environment variable is present on the system, then Splunk Hadoop Connect uses the contents of that variable when you start the Splunk platform.
java_home string The path to the Java installation that Splunk Hadoop Connect should use when it communicates with the cluster.

If you use the string '$JAVA_HOME', and the environment variable is present on the system, then Splunk Hadoop Connect uses the contents of that variable when you start the Splunk platform.
kerberos_principal string The fully qualified name of the Kerberos principal that Splunk Hadoop Connect should use to communicate with the cluster.

This attribute is valid if and only if:
* The cluster uses Kerberos authentication
* The Splunk Hadoop Connect app contains a file, located at $SPLUNK_HOME/etc/apps/local/HadoopConnect/clusters/<host>_<ipc_port>/core-site.xml, which directs it to use Kerberos as a means of authentication for the cluster.
kerberos_service_
principal


No spaces or carriage returns in export.con.
string The fully qualified name of the Kerberos principal upon which the HDFS service in this cluster runs.

This value must match the value for the dfs.namenode.kerberos.principal attribute in core-site.xml of your Hadoop cluster.

export.conf

The export.conf file controls how Splunk Hadoop Connect exports data to HDFS.

The file contains one stanza, whose name can be anything you want. Within the stanza, the configuration file accepts the following attributes:

Attribute Type Description Example
uri string The Uniform Resource Identifier (URI) of the HDFS cluster to export data to. hdfs://bigdata.example.com/
base_path string The path to export the data to. This value is appended to the contents of the uri attribute to form the home location. /home/data/export
partition_fields string list A comma-separated list of fields to partition the export data by. Places events with common field values in the same path.

The supported fields are: date,hour, host, source, and sourcetype
date,host
search string The search whose events Splunk Hadoop Connect should export. index=BigData
roll_size unsigned integer The maximum size the working file can be before Splunk Hadoop Connect exports it to HDFS.

This value should be less than the block size for HDFS.
63
minspan unsigned integer The minimum amount of time, in seconds, that must have elapsed since Splunk Hadoop Connect last triggered an export before triggering a new export. 1800
maxspan unsigned integer The maximum index time range, in seconds, that a single export job must process.

Make this value less than 86400 (24 hours, 1 day) to ensure that Splunk Hadoop Connect does not roll back a lot of work.
28800
starttime unsigned integer What time, in number of seconds since 0:00:00 UTC on January 1, 1970, to start an export if no cursor is present. This attribute can be populated through the Export From field of the Create Scheduled HDFS Export page.

Set this value to 0 to begin exporting data from the current time.
1347526800
endtime unsigned integer The export end time.

Set this value to 0 to prevent ever ending exports.
0
kerberos_principal string Which Kerberos principal to use when attempting to communicate with the HDFS cluster.

Use if the HDFS cluster uses Kerberos authentication.
biguser/bigdata@EXAMPLE.COM
parallel_searches unsigned integer OR the word max The number of parallel searches that spawn for exporting data to HDFS. Parallel searches search disjoint sets of data which guarantees no overlap of processed data. max
replication integer The replication factor to use for a specific export job.

Use the default of 0 to denote the default replication level specified in the cluster.
0
format string The format of a specific export job. The supported formats are raw, json, xml, csv, and tsv. csv
fields string list A comma-separated list of fields to export. Can contain wildcard fields, represented by *, if the format allows it.

The csv and tsv formats do not allow for wildcard fields. Also, the raw format only supports the _raw field.
date,time
compress_level integer The compression level of the export data files. Valid values are between 0 to 9. 0 means no compression. The default value is 2. 2

Status attributes

As Splunk Hadoop Connect exports data to your HDFS cluster, it logs its progress within the following status attributes in the same file. You cannot edit these attributes, and they might change rapidly based on the status of the export job.

The status attributes and descriptions appear in the table.

Attribute Type Description Example
status string Indicates the current phase of the export job. The values include initializing, searching, renaming, done, or failed. searching
status.jobs string The number of export searches that Splunk Hadoop Connect has spawned.
status.jobs.psid string The search ID of the scheduled search job.
status.jobs.sids string list A comma-delimited list of search IDs of all export searches.
status.jobs.progress floating point The overall progress info on all active export searches
status.jobs.runtime floating point The total amount of time for which all parallel export searches have been running.
status.jobs.errors string If status is failed, contains an error message.
status.jobs.earliest string The earliest possible index time, in Unix epoch time, of search results of the current export process.
status.jobs.latest string The latest possible index time of search results of the current export process.
status.jobs.starttime unsigned integer The current export process start time, in Unix epoch time.
status.jobs.endtime string The current export process end time.
status.earliest unsigned integer The earliest possible index time of all export data for all time.
status.latest unsigned integer The latest possible index time of all export data for all time.
status.load floating point The load factors for the last 10 successful jobs, where load is equal to the total execution time of those jobs divided by the export time range.

inputs.conf

Splunk Hadoop Connect supports the following additional stanza in inputs.conf to gather data from a HDFS resource.

This stanza is valid on Splunk version 5.0 and later.

The stanza name must be in the format [hdfs://<hdfs resource>], where hdfs resource is the HDFS resource that you want the Splunk platform to index.

Within the stanza, the configuration file accepts the following attributes:

Attribute Type Description Default
sourcetype string Override the default sourcetype value of hdfs and use the contents of this attribute.

This attribute is optional.
whitelist string (regular expression) Monitor only the files that match the regular expression that this attribute defines.

This attribute is optional.
blacklist string (regular expression) Not to monitor files that match the regular expression that this attribute defines.

This attribute is optional. If you define both this attribute and whitelist, Splunk Hadoop Connect processes this attribute first.

Example files

See examples of how to use these configuration files in "Configure Splunk Hadoop Connect" in this manual.

PREVIOUS
Log file reference
  NEXT
Troubleshoot Splunk Hadoop Connect

This documentation applies to the following versions of Splunk® Hadoop Connect: 1.1, 1.2, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5


Comments

Typo fixed, thanks for pointing it out!

Jworthington splunk
May 21, 2014

I think there's a typo in the partition_fields description because I don't think you want the blocks displayed on the page.

SloshBurch
March 3, 2014

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters