Set up virtual indexes for archived Hadoop files
For any provider, you can configure Hunk to read archived files
Use the following procedure to set up a virtual index for archived files. See "Add or edit a virtual index" in this manual for information about adding a virtual index via the Hunk user interface.
1. Define one or more virtual indexes for each provider.
[hadoop] vix.provider = <provider_name> vix.input.1.path = har:///<path_to_archive_file>/<archive_file>.har/... vix.input.1.accept = \.gz$ vix.input.1.et.regex = /home/myindex/data/(\d+)/(\d+)/ vix.input.1.et.format = yyyyMMddHH vix.input.1.et.offset = 0 vix.input.1.lt.regex = /home/myindex/data/(\d+)/(\d+)/ vix.input.1.lt.format = yyyyMMddHH vix.input.1.lt.offset = 3600
- For
vix.input.1.path
: Prefix the path withhar:
then provide the fully-qualified path to the archive files and any fields you want to extract from the path.
For example:
har:////some/path/<archive_file>.har/${date_date}/${date_hour}/${host}/${sourcetype}/${app}/...
Items enclosed in ${}'s are extracted as fields and added to each search result from that path. The search will ignore the directories which do not match the search string, thus significantly aiding performance.
- For
vix.input.1.accept
provide a regular expression whitelist of files to match.
- For
vix.input.1.ignore
provide a regular expression blacklist of files to ignore. Note, ignore takes precedence over accept.
2. Use the regex, format, and offset values to extract a time range for the data contained in a particular path. The time range is made up of two parts: earliest time vix.input.1.et
and latest time vix.input.1.lt
. The following configurations can be used:
- For
vix.input.1.et/lt.regex
, provide a regular expression that matches a portion of the directory which provides date and time, to allow for interpreting time from the path.
Use capturing groups to extract the parts that make up the timestamp. The values of the capturing groups are concatenated together and are interpreted according to the specified format. Extracting a time range from the path will significantly speed searching for particular time windows by ignoring directories which fall outside of the search's time range.
- For
vix.input.1.et/lt.format
, provide a date/time format string for how to interpret the data extracted from the above regex. The format string specs can be found in the SimpleDateFormat.
The following two non-standard formats are also supported: epoch
to interpret the data as an epoch time and mtime
to use the modification time of the file rather than the data extracted by the regex.
- For
vix.input.1.et/lt.offset
, you can optionally use it to provide an offset to account for timezone and/or safety boundaries.
Add or edit a virtual index in the user interface | Configure Kerberos authentication |
This documentation applies to the following versions of Hunk®(Legacy): 6.2, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.3.13, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11
Feedback submitted, thanks!