Set up a provider and virtual index in the configuration file
Splunk Analytics for Hadoop reaches End of Life on January 31, 2025.
After you install Splunk and license Splunk Analytics for Hadoop, you can modify indexes.conf
to create a provider and virtual index or use Splunk Web to add virtual indexes and providers.
- To add a virtual index in Splunk Web, see Add a virtual index in this manual.
- To add a new provider, see Add an HDFS provider in this manual.
Before you begin
To configure a provider and virtual index via the configuration files, you edit indexes.conf
. Before you edit indexes.conf
, you should confirm that Splunk Analytics for Hadoop has the proper permissions and gather the information you will need to set up the provider and indexer.
Configure your permissions
Before you set up a provider, make sure that Splunk Analytics for Hadoop has the following permissions:
- Read-only access to the HDFS directory where your virtual index data resides.
- Read-write access to the HDFS directory where your Splunk instance is installed. (This is usually your
splunkMR
directory, for example:User/hue/splunk_mr/dispatch
). Splunk creates the following directories in this directory:/dispatch
(this is the directory where the temp results are stored)./packages
(this is the Splunk.tgz
file that will get copied over to the data node)./bundles
(this is where the configurations are stored.)
- Read-write access to the Datanode where your
/tmp
directory resides. This is the temp directory that you point to when you configurevix.splunk.home.datanode
in your Provider settings.
Gather up the following information
You'll need to know the following information about your search head, file system, and Hadoop configuration:
- The host name and port for the NameNode of the Hadoop cluster.
- The host name and port for the JobTracker of the Hadoop cluster.
- Installation directories of Hadoop client libraries and Java.
- Path to a writable directory on the DataNode/TaskTracker *nix filesystem, the one for which the Hadoop user account has read and write permission.
- Path to a writable directory in HDFS that can be used exclusively by this search head.
Edit Indexes.conf
Edit indexes.conf
to establish a virtual index. This is where you tell Splunk about your Hadoop cluster and about the data you want to access via virtual indexes.
Create indexes.conf
Create a copy of indexes.conf
and place it into your local directory. In this example we are using:
Note: The following changes to indexes.conf
become effective at search time, no restart is necessary.
Create a provider
1. For each different Hadoop cluster you need to create a separate provider
stanza. In this stanza, you provide the path to your Java installation and the path to your Hadoop library, as well as other MapReduce configurations that you want to use when running searches against this cluster.
The attributes in the provider
stanza is merged with the family
stanza, which it inherits from. The "vix." prefix is stripped from each attribute and the values are passed to the MapReduce job configuration.
You must configure the provider first. You may configure multiple indexes for a provider.
[provider:MyHadoopProvider] vix.family = hadoop vix.env.JAVA_HOME = /path_to_java_home vix.env.HADOOP_HOME = /path_to_hadoop_client_libraries
2. Tell Splunk about the cluster, including the NameNode and JobTracker as well as where to find and where to install your Splunk .tgz copy.
vix.mapred.job.tracker = jobtracker.hadoop.splunk.com:8021 vix.fs.default.name = hdfs://hdfs.hadoop.splunk.com:8020 vix.splunk.home.hdfs = /<the path in HDFS that is dedicated to this search head for temp storage> vix.splunk.setup.package = /<the path on the search head to the package to install in the data nodes> vix.splunk.home.datanode = /<the path on the TaskTracker's Linux filesystem on which the above Splunk package should be installed>
Create a virtual index
1. Define one or more virtual indexes for each provider. This is where you can specify how the data is organized into directories, which files are part of the index and some hints about the time range of the content of the files.
[hadoop] vix.provider = MyHadoopProvider vix.input.1.path = /home/myindex/data/${date_date}/${date_hour}/${server}/... vix.input.1.accept = \.gz$ vix.input.1.et.regex = /home/myindex/data/(\d+)/(\d+)/ vix.input.1.et.format = yyyyMMddHH vix.input.1.et.offset = 0 vix.input.1.lt.regex = /home/myindex/data/(\d+)/(\d+)/ vix.input.1.lt.format = yyyyMMddHH vix.input.1.lt.offset = 3600
- For
vix.input.1.path
: Provide a fully qualified path to the data that belongs in this index and any fields you want to extract from the path.
For example:
/some/path/${date_date}/${date_hour}/${host}/${sourcetype}/${app}/...
Items enclosed in ${}'s are extracted as fields and added to each search result from that path. The search will ignore the directories which do not match the search string, which significantly improves performance.
- For
vix.input.1.accept
provide a regular expression list of files to match.
- For
vix.input.1.ignore
provide a regular expression list of files to ignore. Note, ignore takes precedence over accept.
2. Use the regex, format, and offset values to extract a time range for the data contained in a particular path. The time range is made up of two parts: earliest time vix.input.1.et
and latest time vix.input.1.lt
. The following configurations can be used:
- For
vix.input.1.et/lt.regex
, provide a regular expression that matches a portion of the directory which provides date and time, to allow for interpreting time from the path.
Use capturing groups to extract the parts that make up the timestamp. The values of the capturing groups are concatenated together and are interpreted according to the specified format. Extracting a time range from the path will significantly speed searching for particular time windows by ignoring directories which fall outside of the search's time range.
- For
vix.input.1.et/lt.format
, provide a date/time format string for how to interpret the data extracted from the above regex. The format string specs can be found in the SimpleDateFormat.
The following two non-standard formats are also supported: epoch
to interpret the data as an epoch time and mtime
to use the modification time of the file rather than the data extracted by the regex.
- For
vix.input.1.et/lt.offset
, you can optionally use it to provide an offset to account for timezone and/or safety boundaries.
Set provider configuration variables
Splunk Analytics for Hadoop also provides preset configuration variables for each provider you create. You can leave the preset variables in place or edit them as needed. If you want t edit them, see Provider Configuration Variables in the reference section of this manual.
Note: If you are configuring Splunk Analytics for Hadoop to work with YARN, you must add new settings. See Required configuration variables for YARN in this manual.
Optionally edit props.conf
to define data processing
You can edit props.conf
to define how to process data files. Index and search time attributes are accepted for either type. The example below shows how twitter data (json object representing tweets) is processed using index and search time props. It shows a single line json data, with _time being a calculated
field (note we've disabled index-time timestamping)
[source::/home/somepath/twitter/...] priority = 100 sourcetype = twitter-hadoop SHOULD_LINEMERGE = false DATETIME_CONFIG = NONE [twitter-hadoop] KV_MODE = json EVAL-_time = strptime(postedTime, "%Y-%m-%dT%H:%M:%S.%lZ")
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 7.3.4, 7.3.5, 7.3.6, 7.3.7, 7.3.8, 7.3.9, 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10, 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.1.13, 8.1.14, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 8.2.10, 8.2.11, 8.2.12, 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4, 9.0.5, 9.0.6, 9.0.7, 9.0.8, 9.0.9, 9.0.10, 9.1.0, 9.1.1, 9.1.2, 9.1.3, 9.1.4, 9.1.5, 9.1.6, 9.1.7, 9.2.0, 9.2.1, 9.2.2, 9.2.3, 9.2.4, 9.3.0, 9.3.1, 9.3.2
Feedback submitted, thanks!