Lookup
This topic describes how to use the lookup function in the Splunk Data Stream Processor.
Description
Use the lookup
function to enrich your streaming data with related information that is in a lookup dataset. Field-value pairs in your DSP records are matched with field-value pairs in a lookup dataset.
To use this function, you must first upload a lookup file or connect to a Splunk Enterprise KV Store. Use a CSV lookup if you are using relatively small, static sets of data and a KV Store lookup if you are using large sets of data that you plan to modify frequently.
You must restart any active pipelines that use a lookup if one of the following options applies to you.
- You are modifying the CSV lookup file being used in an active pipeline.
- You are modifying the KV Store lookup connection being used in an active pipeline.
- You are making any changes to the schema of the lookup as represented in
transforms.conf
that is being used in a KV store lookup connection.
You do not need to restart any active pipelines that are using a KV Store lookup if you are adding, updating, or deleting data in the KV Store collection. However, you do need to restart any active pipelines if you are modifying the schema of the KV Store collection.
DSP caches KV Store lookup data for a period of time. See Connect to the Splunk Enterprise KV Store using the Streams API page for information on how to tune caching in your DSP lookups.
Function Input/Output Schema
- Function Input
collection<record<R>>
- This function takes in collections of records with schema R.
- Function Output
collection<record<S>>
- This function outputs the same collection of records but with a different schema S.
Syntax
The required syntax is in bold.
- lookup <lookup_dataset> (<lookup_field> [AS <event_field>] )...
- [ (OUTPUT | OUTPUTNEW) ( <lookup_destfield> [AS <event_destfield>] )...]
Required arguments
- lookup_dataset
- Display name in the Canvas View: Lookup Name
- Syntax: <string>
- Description: The name of the lookup connection. Before using this function, you must either upload a CSV table or connect to a Splunk Enterprise KV Store. See About lookups.
- Example in Canvas View: The Lookup Name field is set to store_info.
- lookup_field
- Display name in the Canvas View: Lookup Field
- Syntax: <string>
- Description: A field in the lookup dataset to match against incoming data. You can specify multiple
lookup_field
values, separated by commas. - Example in Canvas View: The Lookup Field field is set to store_number.
Optional arguments
- event_field
- Display name in the Canvas View: Incoming Stream Field
- Syntax: AS <string>
- Description: The field that contains the values to match in the
lookup_field
in the lookup dataset. You can specify multipleevent_field
values. If the name of theevent_field
is the same as thelookup_field
, you don't need to specify theevent_field
. The name of thelookup_field
is used. - Default: Empty. The name specified in the
lookup_field
argument is used. - Example in Canvas View: The Incoming Stream Field field is set to Store.
- OUTPUT | OUTPUTNEW
- Display name in the Canvas View: Overwrite Record Fields with Lookup Values
- Syntax: OUTPUT | OUTPUTNEW
- Description: Select an output mode. In OUTPUTNEW mode, values from the lookup dataset are only added to the DSP record if that record has a null value for that field or is missing that field entirely. In OUTPUT mode, all field names and values from the lookup dataset are added to the outgoing DSP records. Therefore, output fields that were already in the records are overwritten with values from the lookup.
- Default: OUTPUT
- Example in Canvas View:
- For OUTPUT mode: The Overwrite Record Fields with Lookup Values check box is selected.
- For OUTPUTNEW mode: The Overwrite Record Fields with Lookup Values check box is clear.
- lookup_destfield
- Display name in the Canvas View: Lookup Field to Add
- Syntax: <string>
- Description: A field in the lookup table to be applied to the outgoing data. You can specify multiple
lookup_destfield
values. Used withOUTPUT | OUTPUTNEW
to replace or append field values. If no lookup destination fields are specified, all fields in the lookup dataset that you did not put in thelookup_field
list are added to the outgoing DSP records. - Default: Empty. All fields in the lookup dataset that are not specified in the
lookup_field
list are added to the outgoing DSP records. - Example in Canvas View: The Lookup Field to Add field is set to name.
- event_destfield
- Display name in the Canvas View: Name of the New Field
- Syntax: AS <string>
- Description: A field in the outgoing data. You can specify multiple
event_destfield
values. If the name of theevent_destfield
is the same as thelookup_destfield
, you don't need to specify theevent_destfield
. The name of thelookup_destfield
is used. Used withOUTPUT | OUTPUTNEW
to replace or append field values. - Default: Empty. The name specified in the
lookup_destfield
argument is used. - Example in Canvas View: The Name of the New Field field is set to store_name.
Usage
Limitations for CSV lookups
The cumulative size of all CSV lookups in a single pipeline cannot exceed 50MB. For example, in a single pipeline, you can use one 50 MB CSV lookup file or five 10MB files.
How DSP handles data types for CSV lookups
Each result of a lookup can be either a string element, or a list of string elements. Therefore the types for each of the results are a union<string, collection<string>>
.
For example, the following is a snippet of the lookup table whois
that contains the columns host
, ip
, and user
.
host | ip | user |
---|---|---|
mailsv1 | 127.0.0.1 | Elliot Cooley |
mailsv2 | 192.0.0.7 | Tabitha Hagan |
After uploading the lookup table to DSP, you can then use this lookup in a pipeline.
| from splunk_firehose() | lookup 'whois' host | eval ip_addresses = mvappend(ip) | eval connections=for_each(iterator(ip_addresses, "i"), concat(cast(user, "string"), "@", i)) | ...;
The user
and ip
fields are added as new top-level fields in your outgoing data. Because both of these fields came from the uploaded CSV lookup table and all values in a CSV lookup table are strings, the new data type of both of these fields are union<string, collection<string>>
. The data type is a union, because the value of this field could be either a string or a list of strings. If you want to reference the new user
and ip
fields downstream in your pipeline, you'll need to convert the fields from a union data type to a more specific data type. In the above SPL2, the user
and ip
fields are turned into type string.
How DSP handles data types for Splunk Enterprise KV Store lookups
The Splunk Enterprise KV Store is a generic store of key-value data where you can store data with a limited set of types. Since DSP has a more robust typing system, a mapping must be made from the KV Store types to DSP types. To start, a DSP lookup must determine the schema (data types) of the KV Store fields. It can do that by one of two ways.
- (Recommended) DSP checks the Splunk Enterprise KV Store to see if a schema was defined for the KV Store in
collections.conf
. - As an emergency fallback, if the KV Store does not have a user-defined schema, then DSP will sample 10 records from the KV Store collection to try and infer the data type of the fields in the KV Store collection.
Data types are defined in the KV Store collection
To ensure the proper mapping of KV Store types to DSP types, best practices are to define and enforce data types for fields in the KV Store collection. To learn how to define and enforce data types for the KV Store collection, see Use configuration files to create a KV Store collection in the Splunk Developer Portal.
The following table describes supported Splunk Enterprise KV Store data types and its DSP data type equivalent.
Splunk Enterprise KV Store Data Type | Splunk Data Stream Processor Data Type | Notes |
---|---|---|
string | string | |
number | double | By default, DSP maps KV Store type number to DSP type double. However, if the DSP record field you are mapping to has a different numeric type, such as long or integer, then DSP uses that numeric type instead.
Be careful when using numeric fields in a KV store as your lookup key. Numbers in the Splunk Enterprise KV store are stored as floating point numbers and when queried as a lookup, the database query attempts an exact match. Due to numerical precision of floating points, semantically equivalent operations may yield different results (eg. |
bool | boolean | |
time | double | DSP and Splunk Enterprise both use Unix epoch time, but represent and store timestamps in different ways. In Splunk Enterprise, time is defined in Unix epoch time and stored in a fixed point numeric format. When you store Splunk Enterprise time values in the Splunk Enterprise KV store, it is saved as a floating point number. In DSP, timestamp values are stored in the timestamp field in epoch time format in milliseconds.To convert the Splunk Enterprise time format to the DSP time format, use the parse_millis and parse_nanos functions. Because time is also a numeric data type, the note in the |
array | collection<any> | |
cidr | string | CIDR matching is not currently supported in DSP. |
The Splunk Enterprise KV Store defines two built-in fields for all records: _key and _user, both of type string. If you want to use either of those fields in a DSP lookup, you'll need to include them in your schema definition in the collections.conf
file.
Data types are not defined in the KV Store collection
If you do not define data types in the KV Store collection, DSP attempts to infer the data type of your KV Store collection fields. In order to check that the KV Store schema is valid with DSP, the DSP lookup function samples 10 records from the KV Store collection in order to infer the data type of the KV Store collection fields. Because the KV Store collection supports a limited set of data types, DSP only attempts to infer whether the data type should be a string, number, boolean, or array. If there are any type conflicts, an error is shown when activating a pipeline containing this lookup.
If there are any fields where all of the values are null, then a subsequent query is done to the KV store asking for a row where that field has a non-null value. The type is inferred from the row returned. If no rows are returned, then an error is shown when activating a pipeline containing this lookup.
In order for DSP to infer the schema of the KV Store collection, the collection must have the following properties:
- The data in the collection must have a consistent schema.
- Fields in the collection must be consistently typed, and a field in the collection should not have two different types. If there are at least two rows containing the same fields that have different types, then DSP cannot infer the schema.
- Fields in the collection must have at least one non-null value.
- The collection cannot be empty.
How to convert from Splunk Enterprise time to DSP time
In a Splunk Enterprise KV Store collection, time
is a special data type. When you are using the time
data type, you are using the Splunk Enterprise time format which is in Unix epoch time. The Splunk Enterprise format is a fixed point format of seconds.[millis][nanos]
.
- seconds is seconds since epoch.
- millis is the milliseconds modifier and can be up to 3 digits.
- nanos is the nanoseconds modifier and can be up to 6 digits.
Because time
is not a recognized data type in DSP, you'll need to convert Splunk Enterprise time into a DSP time format. DSP stores time in two fields: timestamp
and nanos
.
- timestamp is milliseconds since epoch.
- nanos is the nanoseconds modifier of timestamp.
Timestamps are stored in Splunk Enterprise KV Store collections as float-64 values, and therefore, you may lose precision when converting from a very small or very large number.
If you are performing lookups to a field that uses the Splunk Enterprise time format, and you want to convert that to DSP time format, use the parse_millis and parse_nanos scalar functions after the lookup function.
SPL2 examples
Examples of common use cases follow. The following examples in this section assume that you are in the SPL View.
When working in the SPL View, you can write the function by providing the arguments in the exact order shown in each use case.
1. Put corresponding information from a lookup dataset into your events
This example enriches incoming streaming data with the data contained in the store_info
lookup dataset using the store
field.
... | lookup store_info Store AS Store;
The store_info
lookup dataset contains this data:
Store | Name | State |
---|---|---|
store1 | Tom's Diner | CA |
store2 | Jill's Diner | CA |
store3 | Fred's Diner | FL |
The incoming DSP records look something like this:
Timestamp | Body | Store | Revenue |
---|---|---|---|
1589474463634 | store1, 10 | store1 | 10 |
1589474463634 | store2, 15 | store2 | 15 |
1589474463634 | store3, 20 | store3 | 20 |
After configuring the lookup function, the outgoing data is now enriched with the name
and state
fields from the store_info
lookup dataset.
Name | State | Timestamp | Body | Store | Revenue |
---|---|---|---|---|---|
Tom's Diner | CA | 1589474463634 | store1, 10 | store1 | 10 |
Jill's Diner | CA | 1589474463634 | store2, 15 | store2 | 15 |
Fred's Diner | FL | 1589474463634 | store3, 20 | store3 | 20 |
2. Rename and add a new field to your events from a lookup dataset
This example adds a new top-level field called name
to your data from the lookup table file, and renames that field to store_name
.
... | lookup 'store_info' 'store_number' AS store OUTPUT name AS store_name, state AS state | ...;
The store_info
lookup dataset contains this data:
store_number | name | state |
---|---|---|
store1 | Tom's Diner | CA |
store2 | Jill's Diner | CA |
store3 | Fred's Diner | FL |
The incoming DSP records look something like this:
Store | Revenue |
---|---|
store1 | 10 |
store2 | 15 |
store3 | 20 |
After configuring the lookup function, the outgoing data adds the Name
field to your data and renames the field to store_name
.
store_name | Revenue | Store | state |
---|---|---|---|
Tom's Diner | 10 | store1 | CA |
Jill's Diner | 15 | store2 | CA |
Fred's Diner | 20 | store3 | FL |
3. Lookup stores and return the corresponding state the store is in
In this example, one of the fields in the incoming data is missing its value. This example looks up the missing field value from the lookup file, and uses OUTPUTNEW to fill in the missing value. In addition, this lookup adds the store name to the outgoing data.
... | lookup 'store_info' 'Store Number' AS Store OUTPUTNEW State, Name;
The store_info
lookup dataset contains this data:
Store Number | Name | State |
---|---|---|
store1 | Tom's Diner | CA |
store2 | Jill's Diner | CA |
store3 | Fred's Diner | FL |
The incoming DSP records look like:
Store | Revenue | State |
---|---|---|
store1 | 10 | CA |
store2 | 15 | IL |
store3 | 20 |
After configuring the lookup function, the outgoing data is now enriched with the name
fields from the store_info
lookup dataset and the missing state
values are filled in. Note that even though the CSV lookup dataset contains "CA" as the State
for store2, the lookup function using OUTPUTNEW only updates fields in your streaming data if the field value is empty or null. It does not replace any existing field values, even if they differ from the field values in the lookup dataset. Thus, in your outgoing DSP records, the State
for store2 is still IL.
Store | Revenue | State | Name |
---|---|---|---|
store1 | 10 | CA | Tom's Diner |
store2 | 15 | IL | Jill's Diner |
store3 | 20 | FL | Fred's Diner |
4. Replace data in your events with data from a lookup dataset
This example replaces fields in your incoming data with data in the lookup dataset.
... | lookup 'store_info' 'Store Number' AS Store OUTPUT State | ...;
The store_info
lookup dataset contains this data:
Store Number | Name | State |
---|---|---|
store1 | Tom's Diner | CA |
store2 | Jill's Diner | CA |
store3 | Fred's Diner | FL |
The incoming DSP records look like:
Store | Revenue | State |
---|---|---|
store1 | 10 | WI |
store2 | 15 | IL |
store3 | 20 | FL |
After configuring the lookup function, the values of the State
field have been replaced with the values of the State
field from the lookup table.
Store | Revenue | State |
---|---|---|
store1 | 10 | CA |
store2 | 15 | CA |
store3 | 20 | FL |
Key_by | Merge Events |
This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1-patch02, 1.2.1, 1.2.2-patch02, 1.2.4, 1.2.5, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.4.5
Feedback submitted, thanks!