Splunk® Enterprise

Getting Data In

Download manual as PDF

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Download topic as PDF

Extract data from files with headers

Many structured data files, such as comma-separated value (CSV) files, have information in the file header that Splunk Enterprise can use to extract fields during index-time event processing. You can configure Splunk Enterprise to automatically extract these fields.

For example, a legacy CSV file usually starts with a header row that contains column headers for the values in subsequent rows, for example:

host,status,message,"start date"
srv1.splunk.com,error,"No space left on device",2013-06-10T06:35:00
srv2.splunk.com,ok,-,2013-06-11T06:00:00

Use configuration files to enable automatic header-based field extraction

To extract data from files with headers (such as comma-separated value files, Internet Information Server (IIS) web server logs, and other structured data files), use a combination of inputs.conf and props.conf. Edit these files in $SPLUNK_HOME/etc/system/local/ or in your own custom application directory in $SPLUNK_HOME/etc/apps/<app_name>/local. Inputs.conf specifies the files you want to monitor and which sourcetype Splunk Enterprise should use to monitor them, and props.conf defines the sourcetypes themselves.

For more information on configuration files in general, see "About configuration files" in the Admin manual.

Important: Changes that you make to props.conf (such as enabling automatic header-based field extraction) don't take effect until you restart Splunk Enterprise.

Props.conf attributes for structured data

Splunk Enterprise includes the following attributes in props.conf for working with files that contain headers. For additional attributes in props.conf, review the props.conf specification file.

Attribute Description Default
INDEXED_EXTRACTIONS = [CSV|W3C|TSV|PSV|JSON] Specifies the type of file and the extraction and/or parsing method that Splunk Enterprise should use on the file.

Note: If you set INDEXED_FIELDS = JSON, check that you have not also set KV_MODE = json for the same source type, as this will cause the json fields to be extracted twice, once at index time and again at search time.
n/a (not set)
PREAMBLE_REGEX Some files contain preamble lines. This attribute specifies a regular expression which allows Splunk to ignore these preamble lines, based on the pattern specified. n/a
FIELD_HEADER_REGEX A regular expression that specifies a pattern for prefixed header line. Splunk Enterprise looks for the first line that matches that regex and parses it as header fields. Note that the actual header starts after the matching pattern which is not included in the parsed header fields. You can specify special characters in this attribute. n/a
FIELD_DELIMITER Specifies which character delimits or separates fields in the monitored file or source. You can specify special characters in this attribute. n/a
FIELD_QUOTE Specifies the character to use for quotes in the specified file or source. You can specify special characters in this attribute. n/a
HEADER_FIELD_DELIMITER Specifies which character delimits or separates field names in the header line. You can specify special characters in this attribute. If HEADER_FIELD_DELIMITER is not specified, FIELD_DELIMITER applies to the header line. n/a
HEADER_FIELD_QUOTE Specifies which character is used for quotes around field names in the header line. You can specify special characters in this attribute. If HEADER_FIELD_QUOTE is not specified, FIELD_QUOTE applies to the header line. n/a
HEADER_FIELD_LINE_NUMBER Specifies the line number of the line within the file that contains the header fields. If set to 0, Splunk attempts to locate the header fields within the file automatically. 0
TIMESTAMP_FIELDS = field1,field2,...,fieldn Some CSV and structured files have their timestamp encompass multiple fields in the event separated by delimiters. This attribute tells Splunk to specify all such fields which constitute the timestamp in a comma-separated fashion. Splunk Enterprise tries to automatically extract the timestamp of the event.
FIELD_NAMES Some CSV and structured files might have missing headers. This attribute tells Splunk to specify the header field names directly. n/a
MISSING_VALUE_REGEX If Splunk Enterprise finds the specified regular expression in the structured data file, it considers the value for the field in the row to be empty. n/a

Special characters or values are available for some attributes

You can specify special characters or values such as spaces, vertical and horizontal tabs, and form feeds in some attributes by using the "\" character and the appropriate letter that represents the special character:

Special value Props.conf representation
form feed \f
space space or ' '
horizontal tab \t or tab
vertical tab \v
whitespace whitespace
none none or \0
file separator fs or \034
group separator gs or \035
record separator rs or \036
unit separator us or \037

You can specify these special characters for the following attributes:

  • FIELD_DELIMITER
  • FIELD_HEADER_REGEX
  • FIELD_QUOTE

Edit configuration files to create and reference source types

In order to use these attributes, you must edit props.conf. You edit props.conf to create new source types which define how Splunk extracts fields from files with header data in them.

After you have edited props.conf, then edit inputs.conf to reference the newly created source types in files that Splunk Enterprise indexes.

To create and reference the new source types to extract files with headers:

1. Using a text editor, open the file props.conf in the appropriate location as described in "Enable automatic header-based field extraction" earlier in this topic.

Note: If the file does not exist, you must create it.

2. Define a new sourcetype by creating a stanza which tells Splunk Enterprise how to extract the file header and structured file data, using the attributes described above. For example:

[HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim]
FIELD_DELIMITER=,
HEADER_FIELD_DELIMITER=\s
FIELD_QUOTE="

Note: You can define as many stanzas - and thus, as many sourcetypes - as you like in the file.

3. Save the props.conf file and close it.

4. Create a file inputs.conf in the same directory, if it does not already exist.

5. Open the file for editing.

6. Add a stanza which represents the file or files that you want Splunk Enterprise to extract file header and structured data from. For example:

[monitor:///opt/test/data/StructuredData/HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim.csv]
sourcetype=HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim

Note: You can add as many stanzas as you wish for files or directories from which you want to extract header and structured data.

7. Save the inputs.conf file and close it.

8. Restart Splunk Enterprise for the changes to take effect.

Forward data extracted from structured data files

You can also forward fields extracted from a structured data file (a file with headers) to another Splunk Enterprise instance.

To forward fields extracted from structured data files, follow this procedure:

1. On the instance that monitors the files, edit props.conf and inputs.conf as described in "Edit configuration files to create and reference sourcetypes" earlier in this topic.

2. Optionally, if you need to transform this data in any way prior to indexing it, edit transforms.conf.

3. Next, configure the instance to forward data to another Splunk Enterprise instance.

Note: Read "Set up forwarding and receiving" in the Forwarding Data Manual for instructions on how to configure data forwarding and receiving.

4. If you have not already done so, configure Splunk Enterprise to be a receiver on the instance that is to receive the data.

5 Restart Splunk Enterprise on the receiving instance.

6. Restart Splunk Enterprise on the monitoring instance.

7. On the receiving instance, use the Search app to confirm that Splunk Enterprise has extracted the fields from the structured data files and indexed them properly.

Caveats

Splunk Enterprise does not parse structured data that has been forwarded to an indexer

When you forward structured data to an indexer, Splunk Enterprise does not parse this data once it arrives at the indexer, even if you have configured props.conf on that indexer with INDEXED_EXTRACTIONS. Forwarded data skips the following queues on the indexer, which precludes any parsing of that data on the indexer:

  • parsing
  • aggregation
  • typing

The forwarded data must arrive at the indexer already parsed. To achieve this, you must also set up props.conf on the forwarder that sends the data. This includes configuration of INDEXED_EXTRACTIONS and any other parsing, filtering, anonymizing, and routing rules. Universal forwarders are capable of performing these tasks solely for structured data. See "Forward data extracted from header files" earlier in this topic.

Splunk Enterprise only indexes header fields whose rows contain data

When Splunk Enterprise extracts header fields from structured data files, it only extracts those fields where data is present in at least one row. If the header field has no data in any row, Splunk skips that field, and does not index it. Take, for example, the following csv file:

header1,header2,header3,header4,header5
one,1,won,,111
two,2,too,,222
three,3,thri,,333
four,4,fore,,444
five,5,faiv,,555

When Splunk Enterprise reads this file, it notes that the rows in the header4 column are all empty, and does not index that header field or any of the rows in it. This means that neither header4 nor any of the data in its row can be searched for in the index.

If, however, the header4 field contains rows with empty strings (for example, ""), Splunk does index the field and all the rows underneath.

Splunk Enterprise does not support renaming of header fields mid-file

Some software, such as Internet Information Server, supports the renaming of header fields in the middle of the file. Splunk does not recognize changes such as this. If you attempt to index a file which has header fields renamed within the file, Splunk does not index the renamed header field.

Splunk Enterprise does not specify fields extracted from structured data files as being indexed fields in fields.conf

Splunk Enterprise extracts fields from structured data files, but it does not specify them as indexed fields in fields.conf. Therefore we do not recommend that you use the standard search syntax of <field>=<value> for searches on these fields.

To search on structured data fields with maximum efficiency, use the <field>::<value> syntax. This syntax works best in searches for fields that Splunk Enterprise has indexed from structured data, but you can use it to search for default and custom indexed fields as well. You cannot use it to search on Search-time fields.

See "Use fields to retrieve events" in the Search Manual.

Example configuration and data files

Following are an example inputs.conf and props.conf to give you an idea of how to use the file header extraction attributes.

To extract the data locally, edit inputs.conf and props.conf to define inputs and sourcetypes for the structured data files, and use the attributes described above to tell Splunk Enterprise how to deal with the files. To forward this data to another Splunk instance, edit inputs.conf and props.conf on the forwarding instance, and props.conf on the receiving instance.

Inputs.conf

[monitor:///opt/test/data/StructuredData/CSVWithFewHeaderFieldsWithoutAnyValues.csv]
sourcetype=CSVWithFewHeaderFieldsWithoutAnyValues

[monitor:///opt/test/data/StructuredData/VeryLargeCSVFile.csv]
sourcetype=VeryLargeCSVFile

[monitor:///opt/test/data/StructuredData/UselessLongHeaderToBeIgnored.log]
sourcetype=UselessLongHeaderToBeIgnored

[monitor:///opt/test/data/StructuredData/HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim.csv]
sourcetype=HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim

[monitor:///opt/test/data/FieldHeaderRegex.log]
sourcetype=ExtractCorrectHeaders

Props.conf

[CSVWithFewHeaderFieldsWithoutAnyValues]
FIELD_DELIMITER=,

[VeryLargeCSVFile]
FIELD_DELIMITER=,

[UselessLongHeaderToBeIgnored]
HEADER_FIELD_LINE_NUMBER=35
TIMESTAMP_FIELDS=Date,Time,TimeZone
FIELD_DELIMITER=\s
FIELD_QUOTE="

[HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim]
FIELD_DELIMITER=,
HEADER_FIELD_DELIMITER=\s
FIELD_QUOTE="

[ExtractCorrectHeaders]
FIELD_HEADER_REGEX=Ignore_This_Stuff:\s(.*)
FIELD_DELIMITER=,

Sample files

The following are snippets of the files referenced in the above inputs.conf and props.conf examples, to give you an idea of what the files look like.

Note: You might need to scroll right quite a bit to see all of the content.

CSVWithFewHeaderFieldsWithoutAnyValues.csv

vqmcallhistoryid,serialnumber,vqmavgjbenvdelay,vqmavgjbenvnegdelta,vqmavgjbenvposdelta,vqmbitrate,vqmburstcount,vqmburstlengthavgms,vqmburstlengthavgpkts,vqmburstlostrateavg,vqmburstrvaluelq,vqmccmid,vqmcalldurationms,vqmcallid,vqmcallstart,vqmdegradationdelay,vqmdegradationdiscard,vqmdegradationecholevel,vqmdegradationloss,vqmdegradationnoiselevel,vqmdegradationrecency,vqmdegradationsignallevel,vqmdegradationvocoder,vqmdelayavgmsec,vqmdelaycurrentmsec,vqmdelaydecreasecount,vqmdelayincreasecount,vqmdelaymaxmsec,vqmdelayminmsec,vqmdestifc,vqmdestintname,vqmdiscardrateavg,vqmdiscards,vqmdscp,vqmduplicatepkts,vqmearlypeakjitterms,vqmearlypkts,vqmearlythresholdms,vqmearlythresholdpc,vqmearlytotalcount,vqmearlyunderthresholdcount,vqmexcessburst,vqmexcessgap,vqmexternalrvaluecqin,vqmexternalrvaluecqout,vqmexternalrvaluelqin,vqmexternalrvaluelqout,vqmfrom,vqmgapcount,vqmgaplengthavgms,vqmgaplengthavgpkts,vqmgaplostrateavg,vqmgaprvalue,vqmjbmaxdelay,vqmjbmindelay,vqmjbnomdelay,vqmjbtype,vqmjbresetcount,vqmlatepeakjitterms,vqmlatepkts,vqmlatethresholdms,vqmlatethresholdpc,vqmlatetotalcount,vqmlateunderthreshold,vqmlocaldelayaveragems,vqmlocaldelaymaximumms,vqmlocaldelayminimumms,vqmloss,vqmlossrateavg,vqmmaxjbenvnegdelta,vqmmaxjbenvposdelta,vqmmeanpdvabsmaxavg,vqmmeanpdvavg,vqmmeanpdvmaxavg,vqmmeanpdvtrue,vqmminjbenvnegdelta,vqmminjbenvposdelta,vqmmoscq,vqmmoscqfixed,vqmmoslq,vqmmoslqfixed,vqmmosnominal,vqmmosnominalfixed,vqmmospq,vqmmospqfixed,vqmnetworklossrateavg,vqmonewaydelayaverage,vqmonewaydelayinstant,vqmonewaydelaymaximum,vqmoriginationdelayaverage,vqmoriginationdelayinstant,vqmoriginationdelaymaximum,vqmoutoforder,vqmoverrundiscardpkts,vqmppdvms,vqmpdvaveragems,vqmpdvmaximumms,vqmpktsrcvd,vqmrvaluecq,vqmrvalueg107,vqmrvaluelq,vqmrvaluenominal,vqmreliabilityindex,vqmresynccount,vqmrtdelayaverage,vqmrtdelayinstant,vqmrtdelaymaximum,vqmrtpdesturi,vqmrtpdestinationip,vqmrtpdestinationport,vqmrtpssrc,vqmrtpsourceip,vqmrtpsourceport,vqmrtpsourceuri,vqmsourceintname,vqmsrcifc,vqmstreamquality,vqmterminatedelayaverage,vqmterminatedelayinstant,vqmterminatedelaymaximum,vqmthroughputindex,vqmto,vqmunderrundiscardpkts,vqmvocoderclass,vqmvocodertype,created,modified
99152,CFG0730084,-3,-2,356,64000,1,280,14,14.29,36,3499,201000,BW163736844290611-173170743@10.253.143.13,2011-06-29 12:37:37.292,0,4.68,1.43,0.19,0,0,0,0,52,60,15,17,60,10,0,Loopback,0.48,48,46,0,30,1334,10,99.55,10008,9962,0,0,,,,,6096147095,2,100590,5029,0.48,87,200,10,50,1,625,487.5,8767,50,99.58,93,50,,518,,2,0.5,-60,975,488,179,192,999.3,0,0,4.07,,4.12,,4.2,,4.03,,0.02,63,76,76,,,,43,0,6.8,0,520,10054,87,87,89,93,9,79,12,12,12,6096147089,10.255.43.12,10010,706222942,10.253.136.231,25110,6096147095,eth 0/1,2,0,54,80,80,18500,6096147089,48,1,0,2011-06-29 12:41:47.303,2011-06-29 12:41:47.303
99154,CFG0730084,-3,-1,251,64000,4,195,9,20.52,28,3494,359000,BW163502270290611594566299@10.253.143.13,2011-06-29 12:35:02.324,0,2.88,1.11,3.44,0,0,0,0,40,40,26,24,50,10,0,Loopback,0.31,54,46,0,31,2455,10,99.8,17769,17732,0,0,,,,,6096147095,5,71556,3577,0.62,87,200,10,50,1,1120,496.5,15437,50,99.73,123,74,,529,,65,0.67,-62,993,496.5,126,139,3404.7,0,0,4.04,,4.07,,4.2,,3.94,,0.36,58,64,69,,,,49,0,286,0,529,17839,86,86,87,93,9,137,8,8,8,6096147089,10.255.43.12,10000,536353626,10.253.136.231,25096,6096147095,eth 0/1,2,0,48,60,70,30400,6096147089,54,1,0,2011-06-29 12:41:47.342,2011-06-29 12:41:47.342

VeryLargeCSVFile.csv

IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Location,X,Y
030203898,FRAUD,"FORGERY, CREDIT CARD",Tuesday,02/18/2003,16:30,NORTHERN,NONE,2800 Block of VAN NESS AV,-122.424612993055,37.8014488257836
000038261,WARRANTS,WARRANT ARREST,Thursday,04/17/2003,22:45,NORTHERN,"ARREST, BOOKED",POLK ST / SUTTER ST,-122.420120319211,37.7877570602182
030203901,LARCENY/THEFT,GRAND THEFT PICKPOCKET,Tuesday,02/18/2003,16:05,NORTHERN,NONE,VAN NESS AV / MCALLISTER ST,-122.42025048261,37.7800745746105
030203923,DRUG/NARCOTIC,SALE OF BASE/ROCK COCAINE,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619
030203923,OTHER OFFENSES,CONSPIRACY,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619
030203923,OTHER OFFENSES,PROBATION VIOLATION,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619

UselessLongHeaderToBeIgnored.log

************ Start Display Current Environment ************
WebSphere Platform 6.1 [ND 6.1.0.21 cf210844.13]  running with process name sammys_cell_A\fsgwws189Node_A\sammys_A_c01_s189_m06 and process id 17904
Detailed IFix information: ID: 6.1.0-WS-WASSDK-AixPPC32-FP0000021  BuildVrsn: null  Desc: Software Developer Kit 6.1.0.21
ID: 6.1.0-WS-WAS-AixPPC32-FP0000021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21
ID: 6.1.0-WS-WASSDK-AixPPC32-FP0000019  BuildVrsn: null  Desc: Software Developer Kit 6.1.0.19
ID: 6.1.0-WS-WAS-AixPPC32-FP0000019  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.19
ID: sdk.FP61021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21
ID: sdk.FP61019  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.19
ID: was.embed.common.FP61021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21
ID: was.embed.FP61021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21

HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim.csv

"Field 1"  "Field 3" "Field 4"  "Field 6"
Value11,Value12,Value13,Value14,Value15,Value16
Value21,Value22,Value23,Value24,Value25
Value31,Value32,Value33,Value34,Value35, Value36

FieldHeaderRegex.log

Garbage
Garbage
Garbage
Ignore_This_Stuff: Actual_Header1 Actual_Header2

Answers

Have questions? Visit Splunk Answers and see what questions and answers the Splunk community has around extracting fields.

PREVIOUS
Create custom fields at index time
  NEXT
About hosts

This documentation applies to the following versions of Splunk® Enterprise: 6.1, 6.1.1, 6.1.2, 6.1.3, 6.1.4, 6.1.5, 6.1.6, 6.1.7, 6.1.8, 6.1.9, 6.1.10, 6.1.11, 6.1.12, 6.1.13, 6.1.14, 6.2.0, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.2.14, 6.2.15


Comments

When searching for an indexed field using the <field>::<value> syntax as recommended here, Splunk shows a warning of the following:

Searching for <field>=<value> using <field>::<value> is deprecated. If you intended to search for the literal pattern, use "<field>::<value>" instead.

Is the documentation outdated?

Emiller42
July 22, 2015

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters