Hunk®(Legacy)

Hunk User Manual

Download manual as PDF

Download topic as PDF

Configure Hive connectivity

By default, Hive saves data for multiple file formats as either binary files or as a set of text files delimited with special characters. The latest Hive release that Hunk currently supports (v0.12) supports 4 file format types: Textfile, RCfile, ORC files and Sequencefile.

Hunk supports different file formats via it's preprocessor framework. Hunk 6.1 provides a data preprocessor called HiveSplitGenerator which allows Hunk to access and process data stored/used by Hive.

The easiest way to configure Hunk to connect to Hive tables is to edit indexes.conf to:

  • Provide Hunk with the metastore URI.
  • Specify that Hunk use the HiveSplitGenerator to read the Hive data

If you don't want Hunk to access your metastore server, you can manually configure Hunk to access to raw data files that make up your Hive tables. See "Configure Hunk to read your Hive tables without a metastore" in this topic.

Hunk current supports the following versions of Hive:

  • 0.10
  • 0.11
  • 0.12
  • 0.13
  • 0.14
  • 1.2

Before you begin

To set up Hunk to read Hive tables, you must have already configured your indexes and providers, if you have not set them up yet, see:

Configure Hive connectivity with a metastore

To configure Hive connectivity, you provide Hunk with the vix.hive.metastore.uris.

Hunk uses the information in the provided Metastore server to read the table information, including column names, types, data location and format, thus allowing it to process the search request.

Here's an example of a configured provider stanza that properly enables Hive connectivity. Note that a table contains one or more files, and that each virtual index could have multiple input paths, one for each table.

[provider:BigBox]
...
vix.splunk.search.splitter = HiveSplitGenerator 
vix.hive.metastore.uris = thrift://metastore.example.com:9083

[orders]
vix.provider = BigBox
vix.input.1.path = /user/hive/warehouse/user-orders/...
vix.input.1.accept = \.txt$
vix.input.1.splitter.hive.dbname = default 
vix.input.1.splitter.hive.tablename = UserOrders

vix.input.2.path = /user/hive/warehouse/reseller-orders/...
vix.input.2.accept = .*
vix.input.2.splitter.hive.dbname = default
vix.input.2.splitter.hive.tablename = ResellerRrders

In the rare case that the split logic of the Hadoop InputFormat implementation of your table is different from that of Hadoop's FileInputFormat, the HiveSplitGenerator split logic does not work. Instead, you must implement a custom SplitGenerator and use it to replace the default SplitGenerator. See Configure Hunk to use a custom file format for more information.

Configure Hunk to use a custom file format

To use a custom file format, you edit your provider stanza to add a .jar file that contains your custom classes as follows:

vix.splunk.jars

Note that if you don't specify a InputFormat class, files are treated as text files and broken into records by new-line character.

Configure Hunk to read your Hive tables without connect to Metastore

If you are unable or do not wish to expose your Metastore server, you can configure Hive connectivity by specifying additional configuration items. For Hunk, the minimum required information is:

  • dbname
  • tablename
  • columnnames
  • columntypes

Other information is required if you specify it when you create the table (for example if your tables specify InputFormat instead of Hive, you must tell Hunk.)

Create a stanza in indexes.conf that provides Hunk with the list of column names and types of your Hive table(s). These column names become the field names you see when running reports in Hunk:

[your-provider]
vix.splunk.search.splitter = HiveSplitGenerator 

[your-vix]
vix.provider = your-provider
vix.input.1.path = /user/hive/warehouse/employees/...
vix.input.1.splitter.hive.columnnames = name,salary,subordinates,deductions,address
vix.input.1.splitter.hive.columntypes = string:float:array<string>:map<string,float>:struct<street:string,city:string,state:string,zip:int>
vix.input.1.splitter.hive.fileformat  = sequencefile
vix.input.2.path = /user/hive/warehouse/employees_rc/...

Partitioning table data

When using the Hive Metastore, Hunk automatically analyzes the tables, preserving partition keys and values, and, based on your search criteria, pruning any unwanted partitions. This can help speed up searches.

When not using a Metastore, you can update your [virtual-index] stanza to tell Hunk about the partitions using key values as part of the file path. For example, the following configuration

vix.input.1.path = /apps/hive/warehouse/sdc_orc2/${server}/${date_date}/...

would extract and recognize a "server" and a "date_date" partitions in the following path

/apps/hive/warehouse/sdc_orc2/idxr01/20120101/000859_0

Here is an example of a partitioned path that Hunk will automatically recognize the same partitions without any extra configuration

/apps/hive/warehouse/sdc_orc2/server=idxr01/date_date=20120101/000859_0
PREVIOUS
Working with Hive and Parquet data
  NEXT
Configure Parquet connectivity

This documentation applies to the following versions of Hunk®(Legacy): 6.1, 6.1.1, 6.1.2, 6.1.3, 6.2, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11


Comments

Hey there,

Thanks for your question. Versions beyond 0.12 are not mentioned because we have not yet completed testing for them. That does not necessarily mean they don't work, just that we have not finished testing against them. Once testing is completed I'll update the docs accordingly and let you know personally as well.

Jworthington splunk, Splunker
July 10, 2015

"Hunk current supports the following versions of Hive:
0.10
0.11
0.12"

Is this obsolete or does it mean the latest Hunk version still doesn't support Hive 0.14 ?

IGdealing
July 10, 2015

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters