Working with Hive and Parquet data
Hunk's Data Preprocessors
When Hunk initializes a search for non-HDFS input data, it uses the information contained in Hunk's
FileSplitGenerator class to determine how to split data for parallel processing.
FileSplitGenerator contains the same data split logic defined in Hadoop's
FileInputFormat This means that it works for any data format that can be read by Hadoop's
InputFormat implementation (which has same split logic as
FileSplitGenerator does not work for Hive or Parquet files, so Hunk now also ships with
ParquetSplitGenerator for Hive and Parquet. Any custom Hive files with file-based split logic (such as files created with Hadoop
FileOutputFormat and its subclasses) works with the
HiveSplitGenerator. If you have custom Hive file formats that do not use file-based data split logic, you can implement a custom
SplitGenerator that uses your split logic.
Parquet files created by all tools (including Hive) work with (and only with)
Configure Hunk to read Hadoop Archive (HAR) files
Configure Hive connectivity
This documentation applies to the following versions of Hunk®(Legacy): 6.1, 6.1.1, 6.1.2, 6.1.3, 6.2, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11