Splunk® Enterprise

Getting Data In

Download manual as PDF

Download topic as PDF

Extract fields from files with structured data

Many structured data files, such as comma-separated value (CSV) files and Internet Information Server (IIS) web server logs, have information in the file header that can be extracted as fields during indexing. You can configure Splunk Enterprise and the Splunk universal forwarder to automatically extract these values into fields that can be searched. For example, a CSV file starts with a header row that contains column headers for the values in subsequent rows:

host,status,message,"start date"
srv1.splunk.com,error,"No space left on device",2013-06-10T06:35:00
srv2.splunk.com,ok,-,2013-06-11T06:00:00

Input types that the indexed field extraction feature supports

This feature works with the following input types:

  • File-based inputs only (such as monitoring files, directories, or archives.)
  • Inputs that use the oneshot input type (or through the "Upload" feature in Splunk Web.)

It does not work with modular inputs, network inputs, or any other type of input.

More information on source types and time stamps

Use Splunk Web to extract fields from structured data files

When you upload or monitor a structured data file, Splunk Web loads the "Set Source type" page. This page lets you preview how your data will be indexed. See The 'Set Source type' page.

  1. From the Add Data page in Splunk Web, choose Upload or Monitor as the method that you want to add data.
  2. Specify the structured data file that you want the software to monitor. Splunk Web loads the "Set Source type" page. It sets the source type of the data based on its interpretation of that data. For example, if you upload a CSV file, it sets the source type to csv.
  3. Review the events in the preview pane on the right side of the page. The events are formatted based on the current source type.
  4. If the events appear to be formatted correctly, click "Next" to proceed to the "Modify input settings" page. Otherwise, configure event formatting by modifying the timestamp, event breaking, and delimited settings until the previewed events look the way that you want.
  5. If you don't want to save the settings as a new source type, return to Step 4. Otherwise, click the Save As button to save the settings as a new source type.
  6. In the dialog that appears, type in a name and description for the new source type.
  7. Select the category for the source type by selecting the category you want from the "Category" drop-down.
  8. Select the application context that the new source type should apply to by choosing from the entries in the "App" drop-down.
  9. Click "Save" to save the source type.
  10. Return to Step 4 to proceed to the "Modify input settings" page.

Structured data files with large numbers of columns might not display all extracted fields in Splunk Search

If you index a structured data file with a large number of columns (for example, a CSV file with 300 columns), you might experience a problem later where the Search app does not appear to return or display all of the fields for that file. While Splunk software has indexed all of the fields correctly, this anomaly occurs because of a configuration setting for how Splunk software extracts the fields at search time.

Before Splunk software displays fields in Splunk Web, it must first extract those fields by performing a search time field extraction. By default, the limit for the number of fields that can be extracted automatically at search time is 100. You can set this number higher by editing the limits.conf file in $SPLUNK_HOME/etc/system/local and changing the limit setting to a number that is higher than the number of columns in the structured data file.

[kv]
limit = 300

If you work with a lot of large CSV files, you might want to configure the setting to a number that reflects the largest number of columns you expect your structured data files to have.

Use configuration files to enable automatic header-based field extraction

You can also use a combination of inputs.conf and props.conf to extract fields from structured data files. Edit these files in $SPLUNK_HOME/etc/system/local/ or in your own custom application directory in $SPLUNK_HOME/etc/apps/<app_name>/local. Inputs.conf specifies the files you want to monitor and the source type to be applied to the events they contain, and props.conf defines the source types themselves. If you have Splunk Enterprise, you can edit the settings on indexer machines or machines where you are running the Splunk universal forwarder. You must restart Splunk Enterprise for any changes that you make to inputs.conf and props.conf to take effect. If you have Splunk Cloud and want configure the extraction of fields from structured data, use the Splunk universal forwarder.

Props.conf attributes for structured data

To configure field extraction for files that contain headers, modify the following attributes in props.conf. For additional attributes in props.conf, review the props.conf specification file.

Attribute Description Default
INDEXED_EXTRACTIONS = {CSV|W3C|TSV|PSV|JSON} Specifies the type of file and the extraction and/or parsing method to be used on the file.

Note: If you set INDEXED_EXTRACTIONS=JSON, check that you have not also set KV_MODE = json for the same source type, which would extract the JSON fields twice, at index time and again at search time.
n/a (not set)
PREAMBLE_REGEX Some files contain preamble lines. This attribute contains a regular expression that Splunk software uses to ignore any matching lines. n/a
FIELD_HEADER_REGEX A regular expression that specifies a pattern for prefixed header line. Splunk software parses the first matching line into header fields. Note that the actual header starts after the matching pattern, which is not included in the parsed header fields. You can specify special characters in this attribute. n/a
FIELD_DELIMITER Specifies which character delimits or separates fields in the monitored file or source. You can specify special characters in this attribute. n/a
FIELD_QUOTE Specifies the character to use for quotes in the specified file or source. You can specify special characters in this attribute. n/a
HEADER_FIELD_DELIMITER Specifies which character delimits or separates field names in the header line. You can specify special characters in this attribute. If HEADER_FIELD_DELIMITER is not specified, FIELD_DELIMITER applies to the header line. n/a
HEADER_FIELD_QUOTE Specifies which character is used for quotes around field names in the header line. You can specify special characters in this attribute. If HEADER_FIELD_QUOTE is not specified, FIELD_QUOTE applies to the header line. n/a
HEADER_FIELD_LINE_NUMBER Specifies the line number of the line within the file that contains the header fields. If set to 0, Splunk attempts to locate the header fields within the file automatically. 0
TIMESTAMP_FIELDS = field1,field2,...,fieldn Some CSV and structured files have their timestamp encompass multiple fields in the event separated by delimiters. This attribute tells Splunk software to specify all such fields which constitute the timestamp in a comma-separated fashion. Splunk Enterprise tries to automatically extract the timestamp of the event.
FIELD_NAMES Some CSV and structured files might have missing headers. This attribute specifies the header field names. n/a
MISSING_VALUE_REGEX If Splunk software finds data that matches the specified regular expression in the structured data file, it considers the value for the field in the row to be empty. n/a

Special characters or values are available for some attributes

You can use special characters or values such as spaces, vertical and horizontal tabs, and form feeds in some attributes. The following table lists these characters:

Special value Props.conf representation
form feed \f
space space or ' '
horizontal tab \t or tab
vertical tab \v
whitespace whitespace
none none or \0
file separator fs or \034
group separator gs or \035
record separator rs or \036
unit separator us or \037

You can use these special characters for the following attributes only:

  • FIELD_DELIMITER
  • FIELD_HEADER_REGEX
  • FIELD_QUOTE

Edit configuration files to create and reference source types

To create and reference the new source types to extract files with headers:

  1. Using a text editor, open the file props.conf in the appropriate location as described in Enable automatic header-based field extraction earlier in this topic. If the props.conf file does not exist, you must create it.
  2. Define a new sourcetype by creating a stanza which tells Splunk Enterprise how to extract the file header and structured file data, using the attributes described above. You can define as many stanzas - and thus, as many sourcetypes - as you like in the file. For example:

    [HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim]
    FIELD_DELIMITER=,
    HEADER_FIELD_DELIMITER=\s
    FIELD_QUOTE="
    
  3. Save the props.conf file and close it.
  4. Create a file inputs.conf in the same directory, if it does not already exist.
  5. Open the file for editing.
  6. Add a stanza which represents the file or files that you want Splunk Enterprise to extract file header and structured data from. You can add as many stanzas as you wish for files or directories from which you want to extract header and structured data. For example:

    [monitor:///opt/test/data/StructuredData/HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim.csv]
    sourcetype=HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim
    
  7. Save the inputs.conf file and close it.
  8. Restart Splunk Enterprise or the universal forwarder for the changes to take effect.

Forward data extracted from structured data files

You can also forward fields extracted from a structured data file to a heavy forwarder or a universal forwarder.

To forward fields extracted from structured data files:

  1. Configure the Splunk instance that monitors the files to forward data to a heavy forwarder or a universal forwarder.\.
  2. Configure the receiving instance.
  3. On the monitoring instance, configure props.conf and inputs.conf to properly handle event breaking and timestamps for your data. You can do this in one of two ways.

  4. Optionally, if you need to transform this data in any way prior to indexing it, edit transforms.conf.
  5. Restart the receiving instance.
  6. Restart the monitoring instance.
  7. On the receiving instance, use the Search app to confirm that the fields have been extracted from the structured data files and properly indexed.

Caveats to extracting fields from structured data files

Splunk software does not parse structured data that has been forwarded to an indexer

When you forward structured data to an indexer, it is not parsed when it arrives at the indexer, even if you have configured props.conf on that indexer with INDEXED_EXTRACTIONS. Forwarded data skips the following pipelines on the indexer, which precludes any parsing of that data on the indexer:

  • parsing
  • merging
  • typing

The forwarded data must arrive at the indexer already parsed.

Field extraction settings for forwarded structured data must be configured on the forwarder

If you want to forward fields that you extract from structured data files to another Splunk instance, you must configure the props.conf settings that define the field extractions on the forwarder that sends the data. This includes configuration of INDEXED_EXTRACTIONS and any other parsing, filtering, anonymizing, and routing rules. Performing these actions on the instance that indexes the data will have no effect, as the forwarded data must arrive at the indexer already parsed.

When you use Splunk Web to modify event break and time stamp settings, it records all of the proposed changes as a stanza for props.conf. You can find those settings in the "Advanced" tab on the "Set Source type" page.

Use the "Copy to clipboard" link in the "Advanced" tab to copy the proposed changes to props.conf to the system clipboard. You can then paste this stanza into props.conf in a text editor on Splunk instances that monitor and forward similar files.

Only header fields containing data are indexed

When Splunk software extracts header fields from structured data files, it only extracts those fields where data is present in at least one row. If the header field has no data in any row, it is skipped (that is, not indexed). Take, for example, the following csv file:

header1,header2,header3,header4,header5
one,1,won,,111
two,2,too,,222
three,3,thri,,333
four,4,fore,,444
five,5,faiv,,555

When Splunk software reads this file, it notes that the rows in the header4 column are all empty, and does not index that header field or any of the rows in it. This means that neither header4 nor any of the data in its row can be searched for in the index.

If, however, the header4 field contains rows with empty strings (for example, ""), the field and all the rows underneath it are indexed.

No support for mid-file renaming of header fields

Some software, such as Internet Information Server, supports the renaming of header fields in the middle of the file. Splunk software does not recognize changes such as this. If you attempt to index a file that has header fields renamed within the file, the renamed header field is not indexed.

Example configuration and data files

Following are an example inputs.conf and props.conf to give you an idea of how to use the file header extraction attributes.

To extract the data locally, edit inputs.conf and props.conf to define inputs and sourcetypes for the structured data files, and use the attributes described above to specify how to deal with the files. To forward this data to another Splunk instance, edit inputs.conf and props.conf on the forwarding instance, and props.conf on the receiving instance.

Inputs.conf

[monitor:///opt/test/data/StructuredData/CSVWithFewHeaderFieldsWithoutAnyValues.csv]
sourcetype=CSVWithFewHeaderFieldsWithoutAnyValues

[monitor:///opt/test/data/StructuredData/VeryLargeCSVFile.csv]
sourcetype=VeryLargeCSVFile

[monitor:///opt/test/data/StructuredData/UselessLongHeaderToBeIgnored.log]
sourcetype=UselessLongHeaderToBeIgnored

[monitor:///opt/test/data/StructuredData/HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim.csv]
sourcetype=HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim

[monitor:///opt/test/data/FieldHeaderRegex.log]
sourcetype=ExtractCorrectHeaders

Props.conf

[CSVWithFewHeaderFieldsWithoutAnyValues]
FIELD_DELIMITER=,

[VeryLargeCSVFile]
FIELD_DELIMITER=,

[UselessLongHeaderToBeIgnored]
HEADER_FIELD_LINE_NUMBER=35
TIMESTAMP_FIELDS=Date,Time,TimeZone
FIELD_DELIMITER=\s
FIELD_QUOTE="

[HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim]
FIELD_DELIMITER=,
HEADER_FIELD_DELIMITER=\s
FIELD_QUOTE="

[ExtractCorrectHeaders]
FIELD_HEADER_REGEX=Ignore_This_Stuff:\s(.*)
FIELD_DELIMITER=,

Sample files

The following are snippets of the files referenced in the above inputs.conf and props.conf examples, to give you an idea of what the files look like.

You might need to scroll right quite a bit to see all of the content.

CSVWithFewHeaderFieldsWithoutAnyValues.csv

vqmcallhistoryid,serialnumber,vqmavgjbenvdelay,vqmavgjbenvnegdelta,vqmavgjbenvposdelta,vqmbitrate,vqmburstcount,vqmburstlengthavgms,vqmburstlengthavgpkts,vqmburstlostrateavg,vqmburstrvaluelq,vqmccmid,vqmcalldurationms,vqmcallid,vqmcallstart,vqmdegradationdelay,vqmdegradationdiscard,vqmdegradationecholevel,vqmdegradationloss,vqmdegradationnoiselevel,vqmdegradationrecency,vqmdegradationsignallevel,vqmdegradationvocoder,vqmdelayavgmsec,vqmdelaycurrentmsec,vqmdelaydecreasecount,vqmdelayincreasecount,vqmdelaymaxmsec,vqmdelayminmsec,vqmdestifc,vqmdestintname,vqmdiscardrateavg,vqmdiscards,vqmdscp,vqmduplicatepkts,vqmearlypeakjitterms,vqmearlypkts,vqmearlythresholdms,vqmearlythresholdpc,vqmearlytotalcount,vqmearlyunderthresholdcount,vqmexcessburst,vqmexcessgap,vqmexternalrvaluecqin,vqmexternalrvaluecqout,vqmexternalrvaluelqin,vqmexternalrvaluelqout,vqmfrom,vqmgapcount,vqmgaplengthavgms,vqmgaplengthavgpkts,vqmgaplostrateavg,vqmgaprvalue,vqmjbmaxdelay,vqmjbmindelay,vqmjbnomdelay,vqmjbtype,vqmjbresetcount,vqmlatepeakjitterms,vqmlatepkts,vqmlatethresholdms,vqmlatethresholdpc,vqmlatetotalcount,vqmlateunderthreshold,vqmlocaldelayaveragems,vqmlocaldelaymaximumms,vqmlocaldelayminimumms,vqmloss,vqmlossrateavg,vqmmaxjbenvnegdelta,vqmmaxjbenvposdelta,vqmmeanpdvabsmaxavg,vqmmeanpdvavg,vqmmeanpdvmaxavg,vqmmeanpdvtrue,vqmminjbenvnegdelta,vqmminjbenvposdelta,vqmmoscq,vqmmoscqfixed,vqmmoslq,vqmmoslqfixed,vqmmosnominal,vqmmosnominalfixed,vqmmospq,vqmmospqfixed,vqmnetworklossrateavg,vqmonewaydelayaverage,vqmonewaydelayinstant,vqmonewaydelaymaximum,vqmoriginationdelayaverage,vqmoriginationdelayinstant,vqmoriginationdelaymaximum,vqmoutoforder,vqmoverrundiscardpkts,vqmppdvms,vqmpdvaveragems,vqmpdvmaximumms,vqmpktsrcvd,vqmrvaluecq,vqmrvalueg107,vqmrvaluelq,vqmrvaluenominal,vqmreliabilityindex,vqmresynccount,vqmrtdelayaverage,vqmrtdelayinstant,vqmrtdelaymaximum,vqmrtpdesturi,vqmrtpdestinationip,vqmrtpdestinationport,vqmrtpssrc,vqmrtpsourceip,vqmrtpsourceport,vqmrtpsourceuri,vqmsourceintname,vqmsrcifc,vqmstreamquality,vqmterminatedelayaverage,vqmterminatedelayinstant,vqmterminatedelaymaximum,vqmthroughputindex,vqmto,vqmunderrundiscardpkts,vqmvocoderclass,vqmvocodertype,created,modified
99152,CFG0730084,-3,-2,356,64000,1,280,14,14.29,36,3499,201000,BW163736844290611-173170743@10.253.143.13,2011-06-29 12:37:37.292,0,4.68,1.43,0.19,0,0,0,0,52,60,15,17,60,10,0,Loopback,0.48,48,46,0,30,1334,10,99.55,10008,9962,0,0,,,,,6096147095,2,100590,5029,0.48,87,200,10,50,1,625,487.5,8767,50,99.58,93,50,,518,,2,0.5,-60,975,488,179,192,999.3,0,0,4.07,,4.12,,4.2,,4.03,,0.02,63,76,76,,,,43,0,6.8,0,520,10054,87,87,89,93,9,79,12,12,12,6096147089,10.255.43.12,10010,706222942,10.253.136.231,25110,6096147095,eth 0/1,2,0,54,80,80,18500,6096147089,48,1,0,2011-06-29 12:41:47.303,2011-06-29 12:41:47.303
99154,CFG0730084,-3,-1,251,64000,4,195,9,20.52,28,3494,359000,BW163502270290611594566299@10.253.143.13,2011-06-29 12:35:02.324,0,2.88,1.11,3.44,0,0,0,0,40,40,26,24,50,10,0,Loopback,0.31,54,46,0,31,2455,10,99.8,17769,17732,0,0,,,,,6096147095,5,71556,3577,0.62,87,200,10,50,1,1120,496.5,15437,50,99.73,123,74,,529,,65,0.67,-62,993,496.5,126,139,3404.7,0,0,4.04,,4.07,,4.2,,3.94,,0.36,58,64,69,,,,49,0,286,0,529,17839,86,86,87,93,9,137,8,8,8,6096147089,10.255.43.12,10000,536353626,10.253.136.231,25096,6096147095,eth 0/1,2,0,48,60,70,30400,6096147089,54,1,0,2011-06-29 12:41:47.342,2011-06-29 12:41:47.342

VeryLargeCSVFile.csv

IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Location,X,Y
030203898,FRAUD,"FORGERY, CREDIT CARD",Tuesday,02/18/2003,16:30,NORTHERN,NONE,2800 Block of VAN NESS AV,-122.424612993055,37.8014488257836
000038261,WARRANTS,WARRANT ARREST,Thursday,04/17/2003,22:45,NORTHERN,"ARREST, BOOKED",POLK ST / SUTTER ST,-122.420120319211,37.7877570602182
030203901,LARCENY/THEFT,GRAND THEFT PICKPOCKET,Tuesday,02/18/2003,16:05,NORTHERN,NONE,VAN NESS AV / MCALLISTER ST,-122.42025048261,37.7800745746105
030203923,DRUG/NARCOTIC,SALE OF BASE/ROCK COCAINE,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619
030203923,OTHER OFFENSES,CONSPIRACY,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619
030203923,OTHER OFFENSES,PROBATION VIOLATION,Tuesday,02/18/2003,17:00,BAYVIEW,"ARREST, BOOKED",1600 Block of KIRKWOOD AV,-122.390718076188,37.7385560584619

UselessLongHeaderToBeIgnored.log

************ Start Display Current Environment ************
WebSphere Platform 6.1 [ND 6.1.0.21 cf210844.13]  running with process name sammys_cell_A\fsgwws189Node_A\sammys_A_c01_s189_m06 and process id 17904
Detailed IFix information: ID: 6.1.0-WS-WASSDK-AixPPC32-FP0000021  BuildVrsn: null  Desc: Software Developer Kit 6.1.0.21
ID: 6.1.0-WS-WAS-AixPPC32-FP0000021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21
ID: 6.1.0-WS-WASSDK-AixPPC32-FP0000019  BuildVrsn: null  Desc: Software Developer Kit 6.1.0.19
ID: 6.1.0-WS-WAS-AixPPC32-FP0000019  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.19
ID: sdk.FP61021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21
ID: sdk.FP61019  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.19
ID: was.embed.common.FP61021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21
ID: was.embed.FP61021  BuildVrsn: null  Desc: WebSphere Application Server 6.1.0.21

HeaderFieldsWithFewEmptyFieldNamesWithSpaceDelim.csv

"Field 1"  "Field 3" "Field 4"  "Field 6"
Value11,Value12,Value13,Value14,Value15,Value16
Value21,Value22,Value23,Value24,Value25
Value31,Value32,Value33,Value34,Value35, Value36

FieldHeaderRegex.log

Garbage
Garbage
Garbage
Ignore_This_Stuff: Actual_Header1 Actual_Header2

Answers

Have questions? Visit Splunk Answers and see what questions and answers the Splunk community has around extracting fields.

PREVIOUS
Create custom fields at index time
  NEXT
About hosts

This documentation applies to the following versions of Splunk® Enterprise: 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.3.11, 6.3.12, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 7.0.0, 7.0.1


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters