Extract fields from event data using an Edge Processor
You can create a pipeline that extracts specific values from your data into fields. Field extraction lets you capture information from your data in a more visible way and configure further data processing based on those fields.
For example, when working with event data that corresponds to login attempts on an email server, you can extract the usernames from those events into a dedicated field named Username
. You can then use this Username
field to filter for login attempts that were made by a specific user or identify the user that made each login attempt without needing to scan through the entire event.
If you're sending data from an Edge Processor to Splunk Enterprise or Splunk Cloud Platform, be aware that some fields are extracted automatically during indexing. Additionally, be aware that indexing extracted fields can have an impact on indexing performance and search times. Consider the following best practices when configuring field extractions in your pipeline:
- Extract fields only as necessary. When possible, extract fields at search time instead.
- Avoid duplicating your data and increasing the size of your events. After extracting a value into a field, either remove the original value or drop the extracted field after you have finished using it to support a data processing action.
For more information, see When Splunk software extracts fields in the Splunk Cloud Platform Knowledge Manager Manual.
Edge Processors don't extract any fields by default. If your Edge Processor receives data from a data source that doesn't extract data values into fields, such as a universal forwarder without any add-ons, then the Edge Processor stores each event as a text string in a field named _raw
.
Steps
To create a field extraction pipeline, use the Extract fields from action in the pipeline editor to specify regular expressions that identify the field names and values you want to extract.
You must write these regular expressions using Regular Expression 2 (RE2) syntax. See Regular expression syntax for Edge Processor pipelines for more information.
You can write the regular expressions manually or select from a library of prewritten regular expressions, and then preview the resulting field extractions before applying them.
- Navigate to the Pipelines page and then select New pipeline.
- Select Blank pipeline and then select Next.
- Specify a subset of the data received by the Edge Processor for this pipeline to process. To do this, you must define a partition by completing these steps:
- Select the plus icon () next to Partition or select the option that matches how you would like to create your partition in the Suggestions section.
- In the Field field, specify the event field that you want the partitioning condition to be based on.
- To specify whether the pipeline includes or excludes the data that meets the criteria, select Keep or Remove.
- In the Operator field, select an operator for the partitioning condition.
- In the Value field, enter the value that your partition should filter by to create the subset. Then select Apply. You can create as many conditions for a partition in a pipeline by selecting the plus icon ().
- Once you have defined your partition, select Next.
- Enter or upload sample data for generating previews that show how your pipeline processes data. The sample data must contain accurate examples of the values that you want to extract into fields.
- Select Next to confirm the sample data that you want to use for your pipeline.
- Select the name of the destination that you want to send data to.
- (Optional) If you selected a Splunk platform S2S or Splunk platform HEC destination, you can configure index routing:
- Select one of the following options in the expanded destinations panel:
Option Description Default The pipeline does not route events to a specific index.
If the event metadata already specifies an index, then the event is sent to that index. Otherwise, the event is sent to the default index of the Splunk platform deployment.Specify index for events with no index The pipeline only routes events to your specified index if the event metadata did not already specify an index. Specify index for all events The pipeline routes all events to your specified index. - If you selected Specify index for events with no index or Specify index for all events, then in the Index name field, select or enter the name of the index that you want to send your data to.
Be aware that the destination index is determined by a precedence order of configurations. See How does an Edge Processor know which index to send data to? for more information.
- Select one of the following options in the expanded destinations panel:
- Select Done to confirm the data destination.
After you complete the on-screen instructions, the pipeline builder displays the SPL2 statement for your pipeline. - To generate a preview of how your pipeline processes data based on the sample data that you provided, select the Preview Pipeline icon ().
- Select the the plus icon () in the Actions section, then select Extract fields from _raw.
- In the Extract fields from _raw dialog box, do the following:
- In the Regular expression field, specify one or more named capture groups using RE2 syntax. The name of the capture group determines the name of the extracted field, and the matched values determine the values of the extracted field. You can select named capture groups from the Insert from library list or enter named capture groups directly in the field.
For example, to extract usernames from the sample events described previously, start by entering the phrase invalid user. Then, from the Insert from library list, select Username. The resulting regular expression looks like this:invalid user (?P<Username>[a-zA-Z0-9._-]+)
- (Optional) By default, the regular expression matches are case sensitive. To make the matches case insensitive, uncheck the Match case check box.
- Use the Events preview pane to validate your regular expression. The events in this pane are based on the last time that you generated a pipeline preview, and the pane highlights the values that match your regular expression for extraction.
- When you are satisfied with the events highlighted in the Events preview pane, select Apply to perform the field extraction.
- To save your pipeline, do the following:
- Select Save pipeline.
- In the Name field, enter a name for your pipeline.
- (Optional) In the Description field, enter a description for your pipeline.
- Select Save.
- To apply this pipeline to an Edge Processor, do the following:
- Navigate to the Pipelines page.
- In the row that lists your pipeline, select the Actions icon () and then select Apply/Remove.
- Select the Edge Processors that you want to apply the pipeline to, and then select Save.
You can only apply pipelines to Edge Processors that are in the Healthy status.
It can take a few minutes for the Edge Processor service to finish applying your pipeline to an Edge Processor. During this time, all Edge Processors that the pipeline is applied to enter the Pending status. To confirm that the process completed successfully, do the following:
- Navigate to the Edge Processors page. Then, verify that the Instance health column for the affected Edge Processors shows that all instances are back in the Healthy status.
- Navigate to the Pipelines page. Then, verify that the Applied column for the pipeline contains a The pipeline is applied icon ().
You might need to refresh your browser to see the latest updates.
For example, the following sample events represent login attempts on an email server and contain examples of how usernames can appear in event data.
Wed Feb 14 2023 23:16:57 mailsv1 sshd[4590]: Failed password for apache from 78.111.167.117 port 3801 ssh2 Wed Feb 14 2023 15:51:38 mailsv1 sshd[1991]: Failed password for grumpy from 76.169.7.252 port 1244 ssh2 Mon Feb 12 2023 09:31:03 mailsv1 sshd[5800]: Failed password for invalid user guest from 66.69.195.226 port 2903 ssh2 Sun Feb 11 2023 14:12:56 mailsv1 sshd[1565]: Failed password for invalid user noone from 187.231.45.62 port 1092 ssh2 Sun Feb 11 2023 07:09:29 mailsv1 sshd[3560]: Failed password for games from 187.231.45.62 port 3752 ssh2 Sat Feb 10 2023 03:25:43 mailsv1 sshd[2442]: Failed password for invalid user admin from 211.166.11.101 port 1797 ssh2 Fri Feb 09 2023 21:45:20 mailsv1 sshd[1689]: Failed password for invalid user guest from 222.41.213.238 port 2658 ssh2 Fri Feb 09 2023 06:27:34 mailsv1 sshd[2226]: Failed password for invalid user noone from 199.15.234.66 port 3366 ssh2 Fri Feb 09 2023 18:32:51 mailsv1 sshd[5710]: Failed password for agarcia from 209.160.24.63 port 1775 ssh2 Thu Feb 08 2023 08:42:11 mailsv1 sshd[3202]: Failed password for invalid user noone from 175.44.1.172 port 2394 ssh2
A rex
command is added to the SPL2 statement of your pipeline, and the new field appears in the Fields list.
You now have a pipeline that extracts specific values from your data into event fields.
For more examples of how you can extract data into fields, see the sections that follow.
Examples of field extractions
The following are examples of field extractions based on sample events. Review these examples to learn more about how you can use field extractions to support further data processing.
- Example: Filter data based on extracted fields
- Example: Extract desired fields and drop all other values
- Example: Extract index names and route events to those indexes
Example: Filter data based on extracted fields
Consider the following events from Linux auditd
logs:
type=ADD_USER msg=audit(1610965369.990:2891): pid=10647 uid=0 auid=2023 ses=196 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=add-user id=2024 exe="/usr/sbin/useradd" hostname=so1 addr=? terminal=pts/0 res=success type=USER_LOGIN msg=audit(1610965163.030:2885): pid=10080 uid=0 auid=2023 ses=196 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=login acct="splunker" exe="/usr/bin/login" hostname=so1 addr=? terminal=pts/0 res=failed type=USER_MGMT msg=audit(1610965370.024:2893): pid=10647 uid=0 auid=2023 ses=196 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=add-home-dir id=2024 exe="/usr/sbin/useradd" hostname=so1 addr=? terminal=pts/0 res=success type=GRP_MGMT msg=audit(1610964902.738:2876): pid=9147 uid=0 auid=2023 ses=196 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=add-shadow-group id=2025 exe="/usr/sbin/groupadd" hostname=so1 addr=? terminal=pts/0 res=success type=USER_LOGIN msg=audit(1611059696.891:3537): pid=17201 uid=2026 auid=2023 ses=222 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=login acct="splunker" exe="/usr/bin/login" hostname=so1 addr=? terminal=pts/0 res=success
You want to filter the data so that only audit logs that indicate failed login attempts remain. To do this, you must first extract the record types and result values from the logs.
- In the Extract fields from _raw dialog box, enter the following in the Regular expression field:
type=(?P<RecordType>[A-Z_]+).*res=(?P<Result>\w+)
This compound regular expression combines two named capture groups:
type=(?P<RecordType>[A-Z_]+)
for extracting theRecordType
field, andres=(?P<Result>\w+)
for extracting theResult
field. - To filter for events related to login attempts, do the following:
- To filter for login attempts that failed, do the following:
- (Optional) If you don't want to keep
RecordType
andResult
as extracted fields after filtering the events, you can add the followingfields
command in the SPL2 statement of the pipeline to drop those fields. Make sure to place this command after thewhere
commands that are used for filtering.| fields - RecordType, Result
After completing these steps, you have a pipeline with the following SPL2 statement:
$pipeline = | from $source | rex field=_raw /type=(?P<RecordType>[A-Z_]+).*res=(?P<Result>\w+)/ | where RecordType = "USER_LOGIN" | where Result = "failed" | fields - RecordType, Result | into $destination;
The preview results show only the following event from the sample data:
type=USER_LOGIN msg=audit(1610965163.030:2885): pid=10080 uid=0 auid=2023 ses=196 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=login acct="splunker" exe="/usr/bin/login" hostname=so1 addr=? terminal=pts/0 res=failed
Data that does not meet filtering conditions is dropped.
Example: Extract desired fields and drop all other values
Consider the following events representing purchases made at a store:
E9FF471F36A91031FE5B6D6228674089, 72E0B04464AD6513F6A613AABB04E701, Credit Card, 7.7, 2018-01-13 04:41:00, 2018-01-13 04:45:00, -73.997292, 40.720982, 4532038713619608 A5D125F5550BE7822FC6EE156E37733A, 08DB3F9FCF01530D6F7E70EB88C3AE5B, Credit Card, 14, 2018-01-13 04:37:00, 2018-01-13 04:47:00, -73.966843,40.756741, 4539385381557252 1E65B7E2D1297CF3B2CA87888C05FE43,F9ABCCCC4483152C248634ADE2435CF0, Game Card, 16.5, 2018-01-13 04:26:00, 2018-01-13 04:46:00, -73.956451, 40.771442
You want to keep the information about when the purchase was made, the price, and the payment method that the customer used. You want to drop the rest of the data, which includes confidential information such as credit card numbers.
- In the Extract fields from _raw dialog box, enter the following in the Regular expression field:
(?P<PaymentMethod>(Credit Card|Game Card)),\s(?P<Price>\d*\.?\d).*(?P<PurchaseTime>(?:\d\d\d\d){1,2}-(?:0?[1-9]|1[0-2])-(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])[T ](?:2[0123]|[01]?[0-9]):?(?:[0-5][0-9])(?::?(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))?(?:Z|[+-](?:2[0123]|[01]?[0-9])(?::?(?:[0-5][0-9])))?)
This compound regular expression combines three named capture groups:
(?P<PaymentMethod>(Credit Card|Game Card))
for extracting thePaymentMethod
field.(?P<Price>\d*\.?\d)
for extracting thePrice
field.(?P<PurchaseTime>(?:\d\d\d\d){1,2}-(?:0?[1-9]|1[0-2])-(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])[T ](?:2[0123]|[01]?[0-9]):?(?:[0-5][0-9])(?::?(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))?(?:Z|[+-](?:2[0123]|[01]?[0-9])(?::?(?:[0-5][0-9])))?)
for extracting thePurchaseTime
field.
- To drop the
_raw
field and only keep thePurchaseTime
,Price
, andPaymentMethod
fields that you extracted, add the followingfields
command in the SPL2 statement of the pipeline.| fields - _raw
After completing these steps, you have a pipeline with the following SPL2 statement:
$pipeline = | from $source | rex field=_raw /(?P<PaymentMethod>(Credit Card|Game Card)),\s(?P<Price>\d*\.?\d).*(?P<PurchaseTime>(?:\d\d\d\d){1,2}-(?:0?[1-9]|1[0-2])-(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])[T ](?:2[0123]|[01]?[0-9]):?(?:[0-5][0-9])(?::?(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))?(?:Z|[+-](?:2[0123]|[01]?[0-9])(?::?(?:[0-5][0-9])))?)/ | fields - _raw | into $destination;
The preview results show the following field-extracted events:
PaymentMethod | Price | PurchaseTime |
---|---|---|
Credit Card | 7.7 | 2018-01-13 04:45:00 |
Credit Card | 14 | 2018-01-13 04:47:00 |
Game Card | 16.5 | 2018-01-13 04:46:00 |
Example: Extract index names and route events to those indexes
Consider the following events representing login attempts across multiple email servers:
Wed Feb 14 2023 23:16:57 mailsv1 sshd[4590]: Failed password for apache from 78.111.167.117 port 3801 ssh2 Wed Feb 14 2023 15:51:38 mailsv2 sshd[1991]: Failed password for grumpy from 76.169.7.252 port 1244 ssh2 Mon Feb 12 2023 09:31:03 mailsv2 sshd[5800]: Failed password for invalid user guest from 66.69.195.226 port 2903 ssh2 Sun Feb 11 2023 14:12:56 mailsv1 sshd[1565]: Failed password for invalid user noone from 187.231.45.62 port 1092 ssh2 Sun Feb 11 2023 07:09:29 mailsv3 sshd[3560]: Failed password for games from 187.231.45.62 port 3752 ssh2 Sat Feb 10 2023 03:25:43 mailsv1 sshd[2442]: Failed password for invalid user admin from 211.166.11.101 port 1797 ssh2 Fri Feb 09 2023 21:45:20 mailsv2 sshd[1689]: Failed password for invalid user guest from 222.41.213.238 port 2658 ssh2 Fri Feb 09 2023 06:27:34 mailsv2 sshd[2226]: Failed password for invalid user noone from 199.15.234.66 port 3366 ssh2 Fri Feb 09 2023 18:32:51 mailsv1 sshd[5710]: Failed password for agarcia from 209.160.24.63 port 1775 ssh2 Thu Feb 08 2023 08:42:11 mailsv1 sshd[3202]: Failed password for invalid user noone from 175.44.1.172 port 2394 ssh2
Your Splunk platform deployment has dedicated indexes for each email server, so you want to route these events to the appropriate index depending on whether they're associated with mailsv1
, mailsv2
, or mailsv3
.
To extract these mailsv
values into a field named index
, in the Extract fields from _raw dialog box, enter the following in the Regular expression field:
(?P<index>(mailsv\d))
A field named index
containing the extracted values is added to your events. When you send these events to a Splunk platform destination, the events are routed to the index indicated by the index
field.
Data routing configurations in the Splunk platform deployment can override the destination index specified by the index field. See How does an Edge Processor know which index to send data to? for more information.
Enrich data with lookups using an Edge Processor | Extract timestamps from event data using an Edge Processor |
This documentation applies to the following versions of Splunk Cloud Platform™: 9.0.2209, 9.0.2303, 9.0.2305, 9.1.2308, 9.1.2312, 9.2.2403, 9.2.2406 (latest FedRAMP release)
Feedback submitted, thanks!