Extract fields from file headers at index time
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Contents
Extract fields from file headers at index time
Certain data sources and source types, such as CSV and MS Exchange log files, can have headers that contain field information. You can configure Splunk to automatically extract these fields during index-time event processing.
For example, a legacy CSV file--which is essentially a static table--could have a header row like
name, location, message, "start date"
which behaves like a series of column headers for the values listed afterwards in the file.
Note: Automatic header-based field extraction doesn't impact index size or indexing performance because it occurs during source typing (before index time).
How automatic header-based field extraction works
When you enable automatic header-based field extraction for a specific source or source type, Splunk scans it for header field information, which it then uses for field extraction. If a source has the necessary header information, Splunk extracts fields using delimiter-based key/value extraction.
Splunk does this at index time by changing the source type of the incoming data to [original_sourcetype]-N, where N is a number). Next, it creates a stanza for this new source type in props.conf, defines a delimeter-based extraction rule for the static table header in transforms.conf, and then ties that extraction rule to the new source type back in its new props.conf stanza. Finally, at search time, Splunk applies field transform to events from the source (the static table file).
You can use fields extracted by Splunk for filtering and reporting just like any other field by selecting them from the fields sidebar in the Search view (select Pick fields to see a complete list of available fields).
Note: Splunk records the header line of a static table in a CSV or similar file as an event. To perform a search that gets a count of the events in the file without including the header event, you can run a search that identifies the file as the source while explicitly excluding the comma delimited list of header names that appears in the event. Here's an example:
source=/my/file.csv NOT "header_field1,header_field2,header_field3,..." | stats count
Enable automatic header-based field extraction
Enable automatic header-based field extraction for any source or source type by editing props.conf. Edit this file in $SPLUNK_HOME/etc/system/local/, or your own custom application directory in $SPLUNK_HOME/etc/apps/<app_name>/local.
Note: If you are using Splunk in a distributed environment, be sure to place the props.conf and transforms.conf files that you update for header-based field extraction on your search head, not the indexer.
For more information on configuration files in general, see "About configuration files" in the Admin manual.
To turn on automatic header-based field extraction for a source or source type, add CHECK_FOR_HEADER=TRUE under that source or source type's stanza in props.conf.
Example props.conf entry for an MS Exchange source:
[MSExchange] CHECK_FOR_HEADER=TRUE ...
OR
[source::C:\\Program Files\\Exchsrvr\\ServerName.log] sourcetype=MSExchange [MSExchange] CHECK_FOR_HEADER=TRUE
Set CHECK_FOR_HEADER=FALSE to turn off automatic header-based field extraction for a source or source type.
Important: Changes you make to props.conf (such as enabling automatic header-based field extraction) won't take effect until you restart Splunk.
Note: CHECK_FOR_HEADER must be in a source or source type stanza.
Changes Splunk makes to configuration files
If you enable automatic header-based field extraction for a source or source type, Splunk adds stanzas to copies of transforms.conf and props.conf in $SPLUNK_HOME/etc/apps/learned/local/ when it extracts fields for that source or source type.
Important: Don't edit these stanzas after Splunk adds them, or the related extracted fields won't work.
transforms.conf
Splunk creates a stanza in transforms.conf for each source type with unique header information matching a source type defined in props.conf. Splunk names each stanza it creates as [AutoHeader-N], where N in an integer that increments sequentially for each source that has a unique header ([AutoHeader-1], [AutoHeader-2],...,[AutoHeader-N]). Splunk populates each stanza with transforms that the fields (using header information).
Here's the transforms.conf entry that Splunk would add for the MS Exchange source, which was enabled for automatic header-based field extraction in the preceding example:
... [AutoHeader-1] DELIMS=" " FIELDS="time", "client-ip", "cs-method", "sc-status" ...
props.conf
Splunk then adds new source type stanzas to props.conf for each source with a unique name, set of fields, and delimiter. Splunk names the stanzas as [yoursource-N], where yoursource is the source type configured with automatic header-based field extraction, and N is an integer that increments sequentially for each transform in transforms.conf.
For example, say you're indexing a number of CSV files. If each of those files has the same set of header fields and with the same delimiter in transforms.conf, Splunk maps the events indexed from those files to a source type of csv-1 in props.conf. But if that batch of CSV files also includes a couple of files with unique sets of fields and delimiters, Splunk gives the events it indexes from those files source types of csv-2 and csv-3, respectively. Events from files with the same source, fieldset, and delimiter in transforms.conf will have the same source type value.
Note: If you want to enable automatic header-based field extraction for a particular source, and you have already manually specified a source type value for that source (either by defining the source type in Splunk Web or by directly adding the source type to a stanza in inputs.conf) be aware that setting CHECK_FOR_HEADER=TRUE for that source allows Splunk to override the source type value you've set for it with the source types generated by the automatic header-based field extraction process. This means that even though you may have set things up in inputs.conf so that all csv files get a source type of csv, once you set CHECK_FOR_HEADER=TRUE, Splunk overrides that source type setting with the incremental source type names described above.
Here's the source type that Splunk would add to props.conf to tie the transform to the MS Exchange source mentioned earlier:
[MSExchange-1] TRANSFORMS-AutoHeader = AutoHeader-1 ...
Note about search and header-based field extraction
Use a wildcard to search for events associated with source types that Splunk generated during header-based field extraction.
For example, a search for sourcetype="yoursource" looks like this:
sourcetype=yoursource*
Examples of header-based field extraction
These examples show how header-based field extraction works with common source types.
MS Exchange source file
This example shows how Splunk extracts fields from an MS Exchange file using automatic header-based field extraction.
This sample MS Exchange log file has a header containing a list of field names, delimited by spaces:
# Message Tracking Log File # Exchange System Attendant Version 6.5.7638.1 # Fields: time client-ip cs-method sc-status 14:13:11 10.1.1.9 HELO 250 14:13:13 10.1.1.9 MAIL 250 14:13:19 10.1.1.9 RCPT 250 14:13:29 10.1.1.9 DATA 250 14:13:31 10.1.1.9 QUIT 240
Splunk creates a header and transform in transforms.conf:
[AutoHeader-1] FIELDS="time", "client-ip", "cs-method", "sc-status" DELIMS=" "
Note that Splunk automatically detects that the delimiter is a whitespace.
Splunk then ties the transform to the source by adding this to the source type stanza in props.conf:
# Original source type stanza you create [MSExchange] CHECK_FOR_HEADER=TRUE ... # source type stanza that Splunk creates [MSExchange-1] REPORT-AutoHeader = AutoHeader-1 ...
Splunk automatically extracts the following fields from each event:
14:13:11 10.1.1.9 HELO 250
-
time="14:13:11" client-ip="10.1.1.9" cs-method="HELO" sc-status="250"
14:13:13 10.1.1.9 MAIL 250
-
time="14:13:13" client-ip="10.1.1.9" cs-method="MAIL" sc-status="250"
14:13:19 10.1.1.9 RCPT 250
-
time="14:13:19" client-ip="10.1.1.9" cs-method="RCPT" sc-status="250"
14:13:29 10.1.1.9 DATA 250
-
time="14:13:29" client-ip="10.1.1.9" cs-method="DATA" sc-status="250"
14:13:31 10.1.1.9 QUIT 240
-
time="14:13:31" client-ip="10.1.1.9" cs-method="QUIT" sc-status="240"
CSV file
This example shows how Splunk extracts fields from a CSV file using automatic header-based field extraction.
Example CSV file contents:
foo,bar,anotherfoo,anotherbar 100,21,this is a long file,nomore 200,22,wow,o rly? 300,12,ya rly!,no wai!
Splunk creates a header and transform in transforms.conf (located in: $SPLUNK_HOME/etc/apps/learned/transforms.conf):
# Some previous automatic header-based field extraction [AutoHeader-1] ... # source type stanza that Splunk creates [AutoHeader-2] FIELDS="foo", "bar", "anotherfoo", "anotherbar" DELIMS=","
Note that Splunk automatically detects that the delim is a comma.
Splunk then ties the transform to the source by adding this to a new source type stanza in props.conf:
... [CSV-1] REPORT-AutoHeader = AutoHeader-2 ...
Splunk extracts the following fields from each event:
100,21,this is a long file,nomore
-
foo="100" bar="21" anotherfoo="this is a long file" anotherbar="nomore"
200,22,wow,o rly?
-
foo="200" bar="22" anotherfoo="wow" anotherbar="o rly?"
300,12,ya rly!,no wai!
-
foo="300" bar="12" anotherfoo="ya rly!" anotherbar="no wai!"
Answers
Have questions? Visit Splunk Answers and see what questions and answers the Splunk community has around extracting fields.
This documentation applies to the following versions of Splunk: 4.1 , 4.1.1 , 4.1.2 , 4.1.3 , 4.1.4 , 4.1.5 , 4.1.6 , 4.1.7 , 4.1.8 View the Article History for its revisions.