Configure Hive connectivity
By default, Hive saves data for multiple file formats as either binary files or as a set of text files delimited with special characters. Splunk Analytics for Hadoop currently supports 4 Hive (v0.12) file format types: Textfile, RCfile, ORC files and Sequencefile.
Splunk Analytics for Hadoop supports different file formats via it's preprocessor framework, providing a data preprocessor called HiveSplitGenerator
that lets Splunk Analytics for Hadoop access and process data stored/used by Hive.
The easiest way to configure Splunk Analytics for Hadoop to connect to Hive tables is to edit indexes.conf
to:
- Provide Splunk Analytics for Hadoop with the metastore URI.
- Specify that Splunk Analytics for Hadoop use the
HiveSplitGenerator
to read the Hive data
If you don't want Splunk Analytics for Hadoop to access your metastore server, you can manually configure it to access raw data files that make up your Hive tables. See "Configure Splunk Analytics for Hadoop to read your Hive tables without a metastore" in this topic.
Splunk Analytics for Hadoop currently supports the following versions of Hive:
- 0.10
- 0.11
- 0.12
- 0.13
- 0.14
- 1.2
Before you begin
To set up Splunk Analytics for Hadoop to read Hive tables, you must have already configured your indexes and providers, if you have not set them up yet, see:
- Set up a provider and virtual index in the configuration file
- Add or edit a virtual index in the user interface
- Add an HDFS provider
Configure Hive connectivity with a metastore
To configure Hive connectivity, you provide the vix.hive.metastore.uris
.
Splunk Analytics for Hadoop uses the information in the provided Metastore server to read the table information, including column names, types, data location and format, thus allowing it to process the search request.
Here's an example of a configured provider stanza that properly enables Hive connectivity. Note that a table contains one or more files, and that each virtual index could have multiple input paths, one for each table.
[provider:BigBox] ... vix.splunk.search.splitter = HiveSplitGenerator vix.hive.metastore.uris = thrift://metastore.example.com:9083 [orders] vix.provider = BigBox vix.input.1.path = /user/hive/warehouse/user-orders/... vix.input.1.accept = \.txt$ vix.input.1.splitter.hive.dbname = default vix.input.1.splitter.hive.tablename = UserOrders vix.input.2.path = /user/hive/warehouse/reseller-orders/... vix.input.2.accept = .* vix.input.2.splitter.hive.dbname = default vix.input.2.splitter.hive.tablename = ResellerRrders
In the rare case that the split logic of the Hadoop InputFormat
implementation of your table is different from that of Hadoop's FileInputFormat
, the HiveSplitGenerator
split logic does not work. Instead, you must implement a custom SplitGenerator
and use it to replace the default SplitGenerator
. See Configure Splunk Analytics for Hadoop to use a custom file format for more information.
Configure Splunk Analytics for Hadoop to use a custom file format
To use a custom file format, you edit your provider stanza to add a .jar
file that contains your custom classes as follows:
vix.splunk.jars
Note that if you don't specify a InputFormat
class, files are treated as text files and broken into records by new-line character.
Configure Splunk Analytics for Hadoop to read your Hive tables without connect to Metastore
If you are unable or do not wish to expose your Metastore server, you can configure Hive connectivity by specifying additional configuration items. For Splunk Analytics for Hadoop, the minimum required information is:
columnnames
columntypes
Other information is required if you specify it when you create the table (for example if your tables specify InputFormat instead of Hive, you must tell Splunk Analytics for Hadoop.)
Create a stanza in indexes.conf
that provides Splunk Analytics for Hadoop with the list of column names and types of your Hive table(s). These column names become the field names you see when running reports in Splunk Analytics for Hadoop:
[your-provider] vix.splunk.search.splitter = HiveSplitGenerator [your-vix] vix.provider = your-provider vix.input.1.path = /user/hive/warehouse/employees/... vix.input.1.splitter.hive.columnnames = name,salary,subordinates,deductions,address vix.input.1.splitter.hive.columntypes = string:float:array<string>:map<string,float>:struct<street:string,city:string,state:string,zip:int> vix.input.1.splitter.hive.fileformat = sequencefile vix.input.2.path = /user/hive/warehouse/employees_rc/...
Partitioning table data
When using the Hive Metastore, Splunk Analytics for Hadoop automatically analyzes the tables, preserving partition keys and values, and, based on your search criteria, pruning any unwanted partitions. This can help speed up searches.
When not using a Metastore, you can update your [virtual-index]
stanza to tell Splunk Analytics for Hadoop about the partitions using key values as part of the file path. For example, the following configuration
vix.input.1.path = /apps/hive/warehouse/sdc_orc2/${server}/${date_date}/...
would extract and recognize a "server" and a "date_date" partitions in the following path
/apps/hive/warehouse/sdc_orc2/idxr01/20120101/000859_0
Here is an example of a partitioned path that Splunk Analytics for Hadoop will automatically recognize the same partitions without any extra configuration
/apps/hive/warehouse/sdc_orc2/server=idxr01/date_date=20120101/000859_0
Working with Hive and Parquet data | Configure Parquet connectivity |
This documentation applies to the following versions of Splunk® Enterprise: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.0.9, 7.0.10, 7.0.11, 7.0.13, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7, 7.1.8, 7.1.9, 7.1.10, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.3.0, 7.3.1, 7.3.2, 7.3.3, 8.0.0, 8.0.1
Feedback submitted, thanks!