Extract fields from files with structured data
Many structured data files, such as comma-separated value (CSV) files and Internet Information Server (IIS) web server logs, have information in the file header that can be extracted as fields during indexing. You can configure Splunk Enterprise and the Splunk universal forwarder to automatically extract these values into fields that can be searched. For example, a CSV file starts with a header row that contains column headers for the values in subsequent rows:
host,status,message,"start date" srv1.splunk.com,error,"No space left on device",2013-06-10T06:35:00 srv2.splunk.com,ok,-,2013-06-11T06:00:00
Input types that the indexed field extraction feature supports
This feature works with the following input types:
- File-based inputs only (such as monitoring files, directories, or archives.)
- Inputs that use the
oneshotinput type (or through the "Upload" feature in Splunk Web.)
It does not work with modular inputs, network inputs, or any other type of input.
More information on source types and time stamps
- For information on how to set source types when importing structured data files, see The "Set source type" page.
- For information on how to adjust timestamps when previewing indexing results, see Adjust time stamps and event breaks.
- For more general information about configuration files, see About configuration files in the Admin manual.
Use Splunk Web to extract fields from structured data files
When you upload or monitor a structured data file, Splunk Web loads the "Set Source type" page. This page lets you preview how your data will be indexed. See The 'Set Source type' page.
- From the Add Data page in Splunk Web, choose Upload or Monitor as the method that you want to add data.
- Specify the structured data file that you want the software to monitor. Splunk Web loads the "Set Source type" page. It sets the source type of the data based on its interpretation of that data. For example, if you upload a CSV file, it sets the source type to
- Review the events in the preview pane on the right side of the page. The events are formatted based on the current source type.
- If the events appear to be formatted correctly, click "Next" to proceed to the "Modify input settings" page. Otherwise, configure event formatting by modifying the timestamp, event breaking, and delimited settings until the previewed events look the way that you want.
- If you don't want to save the settings as a new source type, return to Step 4. Otherwise, click the Save As button to save the settings as a new source type.
- In the dialog that appears, type in a name and description for the new source type.
- Select the category for the source type by selecting the category you want from the "Category" drop-down.
- Select the application context that the new source type should apply to by choosing from the entries in the "App" drop-down.
- Click "Save" to save the source type.
- Return to Step 4 to proceed to the "Modify input settings" page.
Structured data files with large numbers of columns might not display all extracted fields in Splunk Search
If you index a structured data file with a large number of columns (for example, a CSV file with 300 columns), you might experience a problem later where the Search app does not appear to return or display all of the fields for that file. While Splunk software has indexed all of the fields correctly, this anomaly occurs because of a configuration setting for how Splunk software extracts the fields at search time.
Before Splunk software displays fields in Splunk Web, it must first extract those fields by performing a search time field extraction. By default, the limit for the number of fields that can be extracted automatically at search time is 100. You can set this number higher by editing the
limits.conf file in
$SPLUNK_HOME/etc/system/local and changing the
limit setting to a number that is higher than the number of columns in the structured data file.
[kv] limit = 300
If you work with a lot of large CSV files, you might want to configure the setting to a number that reflects the largest number of columns you expect your structured data files to have.
Use configuration files to enable automatic header-based field extraction
You can also use a combination of inputs.conf and props.conf to extract fields from structured data files. Edit these files in
$SPLUNK_HOME/etc/system/local/ or in your own custom application directory in
$SPLUNK_HOME/etc/apps/<app_name>/local. Inputs.conf specifies the files you want to monitor and the source type to be applied to the events they contain, and props.conf defines the source types themselves. If you have Splunk Enterprise, you can edit the settings on indexer machines or machines where you are running the Splunk universal forwarder. You must restart Splunk Enterprise for any changes that you make to
props.conf to take effect. If you have Splunk Cloud and want configure the extraction of fields from structured data, use the Splunk universal forwarder.
Props.conf attributes for structured data
To configure field extraction for files that contain headers, modify the following attributes in props.conf. For additional attributes in props.conf, review the props.conf specification file.
||Specifies the type of file and the extraction and/or parsing method to be used on the file.
Note: If you set
|n/a (not set)|
||Some files contain preamble lines. This attribute contains a regular expression that Splunk software uses to ignore any matching lines.||n/a|
||A regular expression that specifies a pattern for prefixed header line. Splunk software parses the first matching line into header fields. Note that the actual header starts after the matching pattern, which is not included in the parsed header fields. You can specify special characters in this attribute.||n/a|
||Specifies which character delimits or separates fields in the monitored file or source. You can specify special characters in this attribute.||n/a|
||Specifies the character to use for quotes in the specified file or source. You can specify special characters in this attribute.||n/a|
||Specifies which special characters can appear in header fields. When not set, the Splunk software replaces all characters that are neither alphanumeric or a space with underscores. If this setting is configured, the Splunk software does not perform a special character replacement in header field names when the special character matches one that you specify. For example, if you configure this setting to
||Specifies which character delimits or separates field names in the header line. You can specify special characters in this attribute. If HEADER_FIELD_DELIMITER is not specified, FIELD_DELIMITER applies to the header line.||n/a|
||Specifies which character is used for quotes around field names in the header line. You can specify special characters in this attribute. If HEADER_FIELD_QUOTE is not specified, FIELD_QUOTE applies to the header line.||n/a|
||Specifies the line number of the line within the file that contains the header fields. If set to 0, Splunk attempts to locate the header fields within the file automatically.||0|
||Some CSV and structured files have their timestamp encompass multiple fields in the event separated by delimiters. This attribute tells Splunk software to specify all such fields which constitute the timestamp in a comma-separated fashion.||Splunk Enterprise tries to automatically extract the timestamp of the event.|
||Some CSV and structured files might have missing headers. This attribute specifies the header field names.||n/a|
||If Splunk software finds data that matches the specified regular expression in the structured data file, it considers the value for the field in the row to be empty.||n/a|
Special characters or values are available for some attributes
You can use special characters or values such as spaces, vertical and horizontal tabs, and form feeds in some attributes. The following table lists these characters:
|Special value||Props.conf representation|
You can use these special characters for the following attributes only:
Edit configuration files to create and reference source types
To create and reference the new source types to extract files with headers:
- Using a text editor, open the file
props.confin the appropriate location as described in Enable automatic header-based field extraction earlier in this topic. If the
props.conffile does not exist, you must create it.
- Define a new sourcetype by creating a stanza which tells Splunk Enterprise how to extract the file header and structured file data, using the attributes described above. You can define as many stanzas - and thus, as many sourcetypes - as you like in the file. For example:
[HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim] FIELD_DELIMITER=, HEADER_FIELD_DELIMITER=\s FIELD_QUOTE="
- Save the props.conf file and close it.
- Create a file
inputs.confin the same directory, if it does not already exist.
- Open the file for editing.
- Add a stanza which represents the file or files that you want Splunk Enterprise to extract file header and structured data from. You can add as many stanzas as you wish for files or directories from which you want to extract header and structured data. For example:
- Save the inputs.conf file and close it.
- Restart Splunk Enterprise or the universal forwarder for the changes to take effect.
Forward data extracted from structured data files
You can also forward fields extracted from a structured data file to a heavy forwarder or a universal forwarder.
To forward fields extracted from structured data files:
- Configure the Splunk instance that monitors the files to forward data to a heavy forwarder or a universal forwarder.\.
- Configure the receiving instance.
- On the monitoring instance, configure
inputs.confto properly handle event breaking and timestamps for your data. You can do this in one of two ways.
- To use Splunk Web, follow the instructions in Use Splunk Web to extract fields from structured data files earlier in this topic.
- To use configuration files, follow the instructions in Edit configuration files to create and reference sourcetypes earlier in this topic.
- Optionally, if you need to transform this data in any way prior to indexing it, edit
- Restart the receiving instance.
- Restart the monitoring instance.
- On the receiving instance, use the Search app to confirm that the fields have been extracted from the structured data files and properly indexed.
Caveats to extracting fields from structured data files
Splunk software does not parse structured data that has been forwarded to an indexer
When you forward structured data to an indexer, it is not parsed when it arrives at the indexer, even if you have configured
props.conf on that indexer with
INDEXED_EXTRACTIONS. Forwarded data skips the following pipelines on the indexer, which precludes any parsing of that data on the indexer:
The forwarded data must arrive at the indexer already parsed.
Field extraction settings for forwarded structured data must be configured on the forwarder
If you want to forward fields that you extract from structured data files to another Splunk instance, you must configure the
props.conf settings that define the field extractions on the forwarder that sends the data. This includes configuration of
INDEXED_EXTRACTIONS and any other parsing, filtering, anonymizing, and routing rules. Performing these actions on the instance that indexes the data will have no effect, as the forwarded data must arrive at the indexer already parsed.
When you use Splunk Web to modify event break and time stamp settings, it records all of the proposed changes as a stanza for
props.conf. You can find those settings in the "Advanced" tab on the "Set Source type" page.
Use the "Copy to clipboard" link in the "Advanced" tab to copy the proposed changes to
props.conf to the system clipboard. You can then paste this stanza into
props.conf in a text editor on Splunk instances that monitor and forward similar files.
Only header fields containing data are indexed
When Splunk software extracts header fields from structured data files, it only extracts those fields where data is present in at least one row. If the header field has no data in any row, it is skipped (that is, not indexed). Take, for example, the following csv file:
header1,header2,header3,header4,header5 one,1,won,,111 two,2,too,,222 three,3,thri,,333 four,4,fore,,444 five,5,faiv,,555
When Splunk software reads this file, it notes that the rows in the
header4 column are all empty, and does not index that header field or any of the rows in it. This means that neither
header4 nor any of the data in its row can be searched for in the index.
If, however, the
header4 field contains rows with empty strings (for example, ""), the field and all the rows underneath it are indexed.
Take care when allowing special characters for header fields
HEADER_FIELD_ACCEPTABLE_SPECIAL_CHARACTERS setting is designed to manage situations where column headers have characters like
:. If you do not use this setting, the Splunk software replaces those characters with underscores during the ingestion process.
No support for mid-file renaming of header fields
Some software, such as Internet Information Server, supports the renaming of header fields in the middle of the file. Splunk software does not recognize changes such as this. If you attempt to index a file that has header fields renamed within the file, the renamed header field is not indexed.
Example configuration and data files
Following are an example inputs.conf and props.conf to give you an idea of how to use the file header extraction attributes.
To extract the data locally, edit inputs.conf and props.conf to define inputs and sourcetypes for the structured data files, and use the attributes described above to specify how to deal with the files. To forward this data to another Splunk instance, edit inputs.conf and props.conf on the forwarding instance, and props.conf on the receiving instance.
[monitor:///opt/test/data/StructuredData/CSVWithFewHeaderFieldsWithoutAnyValues.csv] sourcetype=CSVWithFewHeaderFieldsWithoutAnyValues [monitor:///opt/test/data/StructuredData/VeryLargeCSVFile.csv] sourcetype=VeryLargeCSVFile [monitor:///opt/test/data/StructuredData/UselessLongHeaderToBeIgnored.log] sourcetype=UselessLongHeaderToBeIgnored [monitor:///opt/test/data/StructuredData/HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim.csv] sourcetype=HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim [monitor:///opt/test/data/FieldHeaderRegex.log] sourcetype=ExtractCorrectHeaders
[CSVWithFewHeaderFieldsWithoutAnyValues] FIELD_DELIMITER=, [VeryLargeCSVFile] FIELD_DELIMITER=, [UselessLongHeaderToBeIgnored] HEADER_FIELD_LINE_NUMBER=35 TIMESTAMP_FIELDS=Date,Time,TimeZone FIELD_DELIMITER=\s FIELD_QUOTE=" [HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim] FIELD_DELIMITER=, HEADER_FIELD_DELIMITER=\s FIELD_QUOTE=" [ExtractCorrectHeaders] FIELD_HEADER_REGEX=Ignore_This_Stuff:\s(.*) FIELD_DELIMITER=,
The following are snippets of the files referenced in the above inputs.conf and props.conf examples, to give you an idea of what the files look like.
You might need to scroll right quite a bit to see all of the content.
IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Location,X,Y 030203898,FRAUD,"FORGERY, CREDIT CARD",Tuesday,02/18/2003,16:30,NORTHERN,NONE,2800 Block of VAN NESS AV,-122.424612993055,37.8014488257836 000038261,WARRANTS,WARRANT ARREST,Thursday,04/17/2003,22:45,NORTHERN,"ARREST, BOOKED",POLK ST / SUTTER ST,-122.420120319211,37.7877570602182 030203901,LARCENY/THEFT,GRAND THEFT PICKPOCKET,Tuesday,02/18/2003,16:05,NORTHERN,NONE,VAN NESS AV / MCALLISTER ST,-122.42025048261,37.7800745746105 030203923,DRUG/NARCOTIC,SALE OF BASE/ROCK COCAINE,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619 030203923,OTHER OFFENSES,CONSPIRACY,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619 030203923,OTHER OFFENSES,PROBATION VIOLATION,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619
************ Start Display Current Environment ************ WebSphere Platform 6.1 [ND 184.108.40.206 cf210844.13] running with process name sammys_cell_A\fsgwws189Node_A\sammys_A_c01_s189_m06 and process id 17904 Detailed IFix information: ID: 6.1.0-WS-WASSDK-AixPPC32-FP0000021 BuildVrsn: null Desc: Software Developer Kit 220.127.116.11 ID: 6.1.0-WS-WAS-AixPPC32-FP0000021 BuildVrsn: null Desc: WebSphere Application Server 18.104.22.168 ID: 6.1.0-WS-WASSDK-AixPPC32-FP0000019 BuildVrsn: null Desc: Software Developer Kit 22.214.171.124 ID: 6.1.0-WS-WAS-AixPPC32-FP0000019 BuildVrsn: null Desc: WebSphere Application Server 126.96.36.199 ID: sdk.FP61021 BuildVrsn: null Desc: WebSphere Application Server 188.8.131.52 ID: sdk.FP61019 BuildVrsn: null Desc: WebSphere Application Server 184.108.40.206 ID: was.embed.common.FP61021 BuildVrsn: null Desc: WebSphere Application Server 220.127.116.11 ID: was.embed.FP61021 BuildVrsn: null Desc: WebSphere Application Server 18.104.22.168
"Field 1" "Field 3" "Field 4" "Field 6" Value11,Value12,Value13,Value14,Value15,Value16 Value21,Value22,Value23,Value24,Value25 Value31,Value32,Value33,Value34,Value35, Value36
Garbage Garbage Garbage Ignore_This_Stuff: Actual_Header1 Actual_Header2
Have questions? Visit Splunk Answers and see what questions and answers the Splunk community has around extracting fields.
Create custom fields at index time
Process events with ingest-time eval
This documentation applies to the following versions of Splunk® Enterprise: 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.0.4, 8.0.5, 8.0.6, 8.0.7, 8.0.8, 8.0.9, 8.0.10
Feedback submitted, thanks!