About regular expressions
Unlike Splunk Enterprise, the uses Java 8 regular expressions for its functions.
To learn more about Java 8 regular expressions and the differences between Java 8 regular expressions and PCRE regular expressions, see the Java 8 regular expressions page in the Oracle documentation.
Regular expressions terminology and syntax
The following table describes common regular expressions terminology and syntax, and is not an exhaustive list. See Java 8 regular expressions on the Oracle documentation for a full list of terminology and syntax.
Term | Description |
---|---|
literal | The exact text of characters to match using a regular expression. |
regular expression | The metacharacters that define the pattern that Splunk software uses to match against the literal. |
groups | Regular expressions allow groupings indicated by the type of bracket used to enclose the regular expression characters. Groups can define character classes, repetition matches, named capture groups, modular regular expressions, and more. You can apply quantifiers to and use alternation within enclosed groups. For example, see the section "Groups, quantifiers, and alternation" or "Capture groups in regular expressions". |
character class | A character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string. To set up a character class, define a range with a hyphen, such as [A-Z] , to match any uppercase letter. Begin the character class with a caret (^) to define a negative match, such as [^A-Z] to match any lowercase letter.
|
character type | Similar to a wildcard, character types represent specific literal matches. For example, a period . matches any character, \w matches words or alphanumeric characters including an underscore, and so on.
|
anchor | Character types that define text formatting positions, such as return (\r ) and newline (\n ). Anchors assert that the engine's current position in the string matches a well-determined location. For example, the beginning of a string or the end of a line. To assert linebreak characters, use the multiline modifier flag, described in the " regular expression modifier flags" section.
|
alternation | Refers to supplying alternate match patterns in the regular expression. Use a vertical bar or pipe character ( | ) to separate the alternate patterns, which can include full regular expressions. For example, grey|gray matches either grey or gray .
|
quantifiers, or repetitions | Use ( *, +, ? ) to define how to match the groups to the literal pattern. For example, * matches 0 or more, + matches 1 or more, and ? matches 0 or 1.
|
backreferences | Literal groups that you can recall for later use. To indicate a backreference to the value, provide a backslash symbol (\ ) and a positive number. For example, (\d\d)\1 matches two digits repeated twice and \1 refers to the matched group. Therefore, this regular expression matches the strings that look like "abab", where a and b are both digits.
|
lookarounds | Match characters, but then gives up the match without consuming them. Lookarounds are zero-length assertions, similar to start/end of string, but they match the given characters. You can use (?...) , in combination with appropriate indicators, to specify a lookaround. Examples:
|
Character types
Character types are short for literal matches. For more information about character types, see Java 8 regular expressions.
Term | Description | Example | Explanation |
---|---|---|---|
\w
|
Match a word character (a letter, number, or underscore character). | \w\w\w
|
Matches any three word characters. |
\W
|
Match a non-word character. | \W\W\W
|
Matches any three non-word characters. |
\d
|
Match a digit character. | \d\d\d-\d\d-\d\d\d\d
|
Matches a Social Security number, or a similar 3-2-4 number string. |
\D
|
Match a non-digit character. | \D\D\D
|
Matches any three non-digit characters. |
\s
|
Match a whitespace character. | \d\s\d
|
Matches a sequence of a digit, a whitespace, and then another digit. |
\S
|
Match a non-whitespace character. | \d\S\d
|
Matches a sequence of a digit, a non-whitespace character, and another digit. |
.
|
Match any character. Use sparingly. | \d\d.\d\d.\d\d
|
Matches a date string such as 12/31/14 or 01.01.15, but can also match 99A99B99. |
Groups, quantifiers, and alternation
Regular expressions allow groupings indicated by the type of bracket used to enclose the regular expression characters. You can apply quantifiers ( *, +, ?
) to the enclosed group and use alternation within the group. For more information about groups and quantifiers, see Java 8 regular expressions.
Term | Description | Example | Explanation |
---|---|---|---|
*
|
Match zero or more times. | \w*
|
Matches zero or more word characters. |
+
|
Match one or more times. | \d+
|
Match at least one digit. |
?
|
Match zero or one time. | \d\d\d-?\d\d-?\d\d\d\d
|
Matches a Social Security Number with or without dashes. |
( )
|
Parentheses define match or capture groups, atomic groups, and lookarounds. | (H..).(o..)
|
When given the string Hello World , this matches Hel and o W .
|
[ ]
|
Square brackets define character classes. | [a-z0-9#]
|
Matches any character that is a through z , 0 through 9 , or # .
|
{ }
|
Curly brackets define repetitions. | \d{3,5}
|
Matches a string of 3 to 5 digits in length. |
< >
|
Angle brackets define named capture groups. Use the syntax (?<var> ...) to set up a named field extraction. See the "Capture groups in regular expressions" section, on this page, for more information.
|
(?<ssn>\d\d\d-\d\d-\d\d\d\d)
|
Pulls out a Social Security Number and assigns it to the ssn field.
|
Using regular expressions in the Canvas Builder vs the SPL2 Pipeline Builder
Regular expressions make liberal use of the backslash character. In the SPL2 Pipeline Builder, you must represent the regex as a string directly, and therefore, the backslash literal in strings need to be written as \\
. In the Canvas Builder, string fields are automatically escaped, so the backslash character should be entered without escaping. For example, the regular expression \d
should be entered as \d
in the canvas builder, but written as \\d
in the SPL2 Builder.
Capture groups in regular expressions
A named capture group is a regular expression grouping that extracts a field value when a regular expression matches an event. Capture groups include the name of the field and are notated with angle brackets as follows:
some text (?<fieldName>regular expression capture pattern) more text
After a capture group is defined within a function, a map of all extracted, matched fields is returned in the format: {"capture_group_1": "matching_expression_1", "capture_group_N":"matching_expression_N"}. If you do not name the capturing group, the group names are returned as "1", "2", "3", "N", etc.
Underscores are not supported in capture group names, for example, <ip_address> is an invalid capturing group name.
For example, if you have this event text, and you want to extract the ip address from the event.
131.253.24.135 fail admin_user
You can use the following regular expression and capturing groups to extract the ip address from the event.
(?<ip>\d+\.\d+\.\d+\.\d+)
This returns a map with the key ip
whose value is the value of the extracted capture group.
For a non-named capture group, a function with the regex (\d+\.\d+\.\d+\.\d+) will return a map with key 1
whose value is the value of the extracted capture group.
regular expression modifier flags
The following regular expression modifier flags are available. Use a modifier flag to update the default regular expression behavior. To use a modifier, place the desired modifier at the beginning of the regular expression pattern in the format /(?MODIFIER)my regular expression/
or at the end of your regular expression pattern in the format /my regular expression/MODIFIER
.
Modifier | Description |
---|---|
c
|
CANON_EQ. Enables canonical equivalence. |
d
|
UNIX_LINES. Enables Unix lines mode. |
i
|
CASE_INSENSITIVE. Enables case-insensitive matching. |
l
|
LITERAL. Enables literal parsing of the pattern. |
m
|
MULTILINE. Enables multiline mode. |
s
|
DOTALL. Enables dotall mode. |
u
|
UNICODE_CASE. Enables Unicode-aware case folding. |
U
|
UNICODE_CHARACTER_CASE. Enables the Unicode version of Predefined character classes and POSIX character classes. |
x
|
COMMENTS. Permits whitespace and comments in pattern. |
As an example, if you have the following body
text:
Jul 20 17:07:55 93.227.214.209 %ASA-6-302014: Teardown TCP connection 304488019 for Outside:151.185.159.199/50867(LOCALxNora) to Inside:10.179.121.51/88 duration 0:00:00 bytes 4514 TCP FINs
You can use the following regular expressions and capture flag to extract the ASA field from that text.
/(?i)(?<ASA>ASA-\d-\d{6})/
/(?<ASA>ASA-\d-\d{6})/i
The first regular expression uses the i
modifier flag to enable case-insensitive matching. The pattern-matching characters used for the named capturing group ASA
are specific. \d
means "digit" and {6}
means "match a string 6 digits in length".
The capture group for ASA wants to match the characters "ASA", followed by a dash, followed by one digit, followed by another dash, and then followed by six digits. This describes the syntax for an Cisco Syslog ASA message.
The second regular expression does the same as the first, except the modifier flag is placed at the end instead of the beginning.
See more
Visit the following pages for resources on how to write Java 8 regular expressions.
data types | Navigating the |
This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1-patch02, 1.2.1, 1.2.2-patch02, 1.2.4, 1.2.5
Feedback submitted, thanks!