Lookup

This topic describes how to use the lookup function in the Splunk Data Stream Processor.

Description

Use the lookup function to enrich your streaming data with related information that is in a lookup dataset. Field-value pairs in your DSP records are matched with field-value pairs in a lookup dataset.

To use this function, you must first upload a lookup file or connect to a Splunk Enterprise KV Store. Use a CSV lookup if you are using relatively small, static sets of data and a KV Store lookup if you are using large sets of data that you plan to modify frequently.

You must restart any active pipelines that use a lookup if one of the following options applies to you.

You are modifying the CSV lookup file being used in an active pipeline.
You are modifying the KV Store lookup connection being used in an active pipeline.
You are making any changes to the schema of the lookup as represented in transforms.conf that is being used in a KV store lookup connection.

You do not need to restart any active pipelines that are using a KV Store lookup if you are adding, updating, or deleting data in the KV Store collection. However, you do need to restart any active pipelines if you are modifying the schema of the KV Store collection.

DSP caches KV Store lookup data for a period of time. See Connect to the Splunk Enterprise KV Store using the Streams API page for information on how to tune caching in your DSP lookups.

Function Input/Output Schema

Function Input: collection<record<R>>; This function takes in collections of records with schema R.
Function Output: collection<record<S>>; This function outputs the same collection of records but with a different schema S.

Syntax

The required syntax is in bold.

lookup <lookup_dataset> (<lookup_field> [AS <event_field>] )...: [ (OUTPUT | OUTPUTNEW) ( <lookup_destfield> [AS <event_destfield>] )...]

Required arguments

lookup_dataset: Display name in the Canvas View: Lookup Name; Syntax: <string>; Description: The name of the lookup connection. Before using this function, you must either upload a CSV table or connect to a Splunk Enterprise KV Store. See About lookups.; Example in Canvas View: The Lookup Name field is set to store_info.

lookup_field: Display name in the Canvas View: Lookup Field; Syntax: <string>; Description: A field in the lookup dataset to match against incoming data. You can specify multiple lookup_field values, separated by commas.; Example in Canvas View: The Lookup Field field is set to store_number.

Optional arguments

event_field: Display name in the Canvas View: Incoming Stream Field; Syntax: AS <string>; Description: The field that contains the values to match in the lookup_field in the lookup dataset. You can specify multiple event_field values. If the name of the event_field is the same as the lookup_field, you don't need to specify the event_field. The name of the lookup_field is used.; Default: Empty. The name specified in the lookup_field argument is used.; Example in Canvas View: The Incoming Stream Field field is set to Store.

OUTPUT | OUTPUTNEW

Display name in the Canvas View: Overwrite Record Fields with Lookup Values

Syntax: OUTPUT | OUTPUTNEW

Description: Select an output mode. In OUTPUTNEW mode, values from the lookup dataset are only added to the DSP record if that record has a null value for that field or is missing that field entirely. In OUTPUT mode, all field names and values from the lookup dataset are added to the outgoing DSP records. Therefore, output fields that were already in the records are overwritten with values from the lookup.

Default: OUTPUT

Example in Canvas View:

For OUTPUT mode: The Overwrite Record Fields with Lookup Values check box is selected.
For OUTPUTNEW mode: The Overwrite Record Fields with Lookup Values check box is clear.

lookup_destfield: Display name in the Canvas View: Lookup Field to Add; Syntax: <string>; Description: A field in the lookup table to be applied to the outgoing data. You can specify multiple lookup_destfield values. Used with OUTPUT | OUTPUTNEW to replace or append field values. If no lookup destination fields are specified, all fields in the lookup dataset that you did not put in the lookup_field list are added to the outgoing DSP records.; Default: Empty. All fields in the lookup dataset that are not specified in the lookup_field list are added to the outgoing DSP records.; Example in Canvas View: The Lookup Field to Add field is set to name.

event_destfield: Display name in the Canvas View: Name of the New Field; Syntax: AS <string>; Description: A field in the outgoing data. You can specify multiple event_destfield values. If the name of the event_destfield is the same as the lookup_destfield, you don't need to specify the event_destfield. The name of the lookup_destfield is used. Used with OUTPUT | OUTPUTNEW to replace or append field values.; Default: Empty. The name specified in the lookup_destfield argument is used.; Example in Canvas View: The Name of the New Field field is set to store_name.

Usage

Limitations for CSV lookups

The cumulative size of all CSV lookups in a single pipeline cannot exceed 50MB. For example, in a single pipeline, you can use one 50 MB CSV lookup file or five 10MB files.

How DSP handles data types for CSV lookups

Each result of a lookup can be either a string element, or a list of string elements. Therefore the types for each of the results are a union<string, collection<string>>.

For example, the following is a snippet of the lookup table whois that contains the columns host, ip, and user.

host	ip	user
mailsv1	127.0.0.1	Elliot Cooley
mailsv2	192.0.0.7	Tabitha Hagan

After uploading the lookup table to DSP, you can then use this lookup in a pipeline.

| from splunk_firehose()
| lookup 'whois' host
| eval ip_addresses = mvappend(ip)
| eval connections=for_each(iterator(ip_addresses, "i"), concat(cast(user, "string"), "@", i))
| ...;

The user and ip fields are added as new top-level fields in your outgoing data. Because both of these fields came from the uploaded CSV lookup table and all values in a CSV lookup table are strings, the new data type of both of these fields are union<string, collection<string>>. The data type is a union, because the value of this field could be either a string or a list of strings. If you want to reference the new user and ip fields downstream in your pipeline, you'll need to convert the fields from a union data type to a more specific data type. In the above SPL2, the user and ip fields are turned into type string.

How DSP handles data types for Splunk Enterprise KV Store lookups

The Splunk Enterprise KV Store is a generic store of key-value data where you can store data with a limited set of types. Since DSP has a more robust typing system, a mapping must be made from the KV Store types to DSP types. To start, a DSP lookup must determine the schema (data types) of the KV Store fields. It can do that by one of two ways.

(Recommended) DSP checks the Splunk Enterprise KV Store to see if a schema was defined for the KV Store in collections.conf.
As an emergency fallback, if the KV Store does not have a user-defined schema, then DSP will sample 10 records from the KV Store collection to try and infer the data type of the fields in the KV Store collection.

Data types are defined in the KV Store collection

To ensure the proper mapping of KV Store types to DSP types, best practices are to define and enforce data types for fields in the KV Store collection. To learn how to define and enforce data types for the KV Store collection, see Use configuration files to create a KV Store collection in the Splunk Developer Portal.

The following table describes supported Splunk Enterprise KV Store data types and its DSP data type equivalent.

Splunk Enterprise KV Store Data Type	Splunk Data Stream Processor Data Type	Notes
string	string
number	double	By default, DSP maps KV Store type number to DSP type double. However, if the DSP record field you are mapping to has a different numeric type, such as long or integer, then DSP uses that numeric type instead. Be careful when using numeric fields in a KV store as your lookup key. Numbers in the Splunk Enterprise KV store are stored as floating point numbers and when queried as a lookup, the database query attempts an exact match. Due to numerical precision of floating points, semantically equivalent operations may yield different results (eg. `x+y != y+x` in some cases). Values as 1.0 and 1.000002 are considered not equal. Best practices are to avoid using fields using floating point numbers as your lookup key by casting it to a different numeric type when possible, such as a long.
bool	boolean
time	double	DSP and Splunk Enterprise both use Unix epoch time, but represent and store timestamps in different ways. In Splunk Enterprise, time is defined in Unix epoch time and stored in a fixed point numeric format. When you store Splunk Enterprise time values in the Splunk Enterprise KV store, it is saved as a floating point number. In DSP, timestamp values are stored in the `timestamp` field in epoch time format in milliseconds. To convert the Splunk Enterprise time format to the DSP time format, use the parse_millis and parse_nanos functions. Because time is also a numeric data type, the note in the `number` section about exact matching for floating point numbers also applies here.
array	collection<any>
cidr	string	CIDR matching is not currently supported in DSP.

The Splunk Enterprise KV Store defines two built-in fields for all records: _key and _user, both of type string. If you want to use either of those fields in a DSP lookup, you'll need to include them in your schema definition in the collections.conf file.

Data types are not defined in the KV Store collection

If you do not define data types in the KV Store collection, DSP attempts to infer the data type of your KV Store collection fields. In order to check that the KV Store schema is valid with DSP, the DSP lookup function samples 10 records from the KV Store collection in order to infer the data type of the KV Store collection fields. Because the KV Store collection supports a limited set of data types, DSP only attempts to infer whether the data type should be a string, number, boolean, or array. If there are any type conflicts, an error is shown when activating a pipeline containing this lookup.

If there are any fields where all of the values are null, then a subsequent query is done to the KV store asking for a row where that field has a non-null value. The type is inferred from the row returned. If no rows are returned, then an error is shown when activating a pipeline containing this lookup.

In order for DSP to infer the schema of the KV Store collection, the collection must have the following properties:

The data in the collection must have a consistent schema.
Fields in the collection must be consistently typed, and a field in the collection should not have two different types. If there are at least two rows containing the same fields that have different types, then DSP cannot infer the schema.
Fields in the collection must have at least one non-null value.
The collection cannot be empty.

How to convert from Splunk Enterprise time to DSP time

In a Splunk Enterprise KV Store collection, time is a special data type. When you are using the time data type, you are using the Splunk Enterprise time format which is in Unix epoch time. The Splunk Enterprise format is a fixed point format of seconds.[millis][nanos].

seconds is seconds since epoch.
millis is the milliseconds modifier and can be up to 3 digits.
nanos is the nanoseconds modifier and can be up to 6 digits.

Because time is not a recognized data type in DSP, you'll need to convert Splunk Enterprise time into a DSP time format. DSP stores time in two fields: timestamp and nanos.

timestamp is milliseconds since epoch.
nanos is the nanoseconds modifier of timestamp.

Timestamps are stored in Splunk Enterprise KV Store collections as float-64 values, and therefore, you may lose precision when converting from a very small or very large number.

If you are performing lookups to a field that uses the Splunk Enterprise time format, and you want to convert that to DSP time format, use the parse_millis and parse_nanos scalar functions after the lookup function.

SPL2 examples

Examples of common use cases follow. The following examples in this section assume that you are in the SPL View.

When working in the SPL View, you can write the function by providing the arguments in the exact order shown in each use case.

1. Put corresponding information from a lookup dataset into your events

This example enriches incoming streaming data with the data contained in the store_info lookup dataset using the store field.

... | lookup store_info Store AS Store;

The store_info lookup dataset contains this data:

Store	Name	State
store1	Tom's Diner	CA
store2	Jill's Diner	CA
store3	Fred's Diner	FL

The incoming DSP records look something like this:

Timestamp	Body	Store	Revenue
1589474463634	store1, 10	store1	10
1589474463634	store2, 15	store2	15
1589474463634	store3, 20	store3	20

After configuring the lookup function, the outgoing data is now enriched with the name and state fields from the store_info lookup dataset.

Name	State	Timestamp	Body	Store	Revenue
Tom's Diner	CA	1589474463634	store1, 10	store1	10
Jill's Diner	CA	1589474463634	store2, 15	store2	15
Fred's Diner	FL	1589474463634	store3, 20	store3	20

2. Rename and add a new field to your events from a lookup dataset

This example adds a new top-level field called name to your data from the lookup table file, and renames that field to store_name.

... | lookup 'store_info' 'store_number' AS store OUTPUT name AS store_name, state AS state | ...;

The store_info lookup dataset contains this data:

store_number	name	state
store1	Tom's Diner	CA
store2	Jill's Diner	CA
store3	Fred's Diner	FL

The incoming DSP records look something like this:

Store	Revenue
store1	10
store2	15
store3	20

After configuring the lookup function, the outgoing data adds the Name field to your data and renames the field to store_name.

store_name	Revenue	Store	state
Tom's Diner	10	store1	CA
Jill's Diner	15	store2	CA
Fred's Diner	20	store3	FL

3. Lookup stores and return the corresponding state the store is in

In this example, one of the fields in the incoming data is missing its value. This example looks up the missing field value from the lookup file, and uses OUTPUTNEW to fill in the missing value. In addition, this lookup adds the store name to the outgoing data.

... | lookup 'store_info' 'Store Number' AS Store OUTPUTNEW State, Name;

The store_info lookup dataset contains this data:

Store Number	Name	State
store1	Tom's Diner	CA
store2	Jill's Diner	CA
store3	Fred's Diner	FL

The incoming DSP records look like:

Store	Revenue	State
store1	10	CA
store2	15	IL
store3	20

After configuring the lookup function, the outgoing data is now enriched with the name fields from the store_info lookup dataset and the missing state values are filled in. Note that even though the CSV lookup dataset contains "CA" as the State for store2, the lookup function using OUTPUTNEW only updates fields in your streaming data if the field value is empty or null. It does not replace any existing field values, even if they differ from the field values in the lookup dataset. Thus, in your outgoing DSP records, the State for store2 is still IL.

Store	Revenue	State	Name
store1	10	CA	Tom's Diner
store2	15	IL	Jill's Diner
store3	20	FL	Fred's Diner

4. Replace data in your events with data from a lookup dataset

This example replaces fields in your incoming data with data in the lookup dataset.

... | lookup 'store_info' 'Store Number' AS Store OUTPUT State | ...;

The store_info lookup dataset contains this data:

Store Number	Name	State
store1	Tom's Diner	CA
store2	Jill's Diner	CA
store3	Fred's Diner	FL

The incoming DSP records look like:

Store	Revenue	State
store1	10	WI
store2	15	IL
store3	20	FL

After configuring the lookup function, the values of the State field have been replaced with the values of the State field from the lookup table.

Store	Revenue	State
store1	10	CA
store2	15	CA
store3	20	FL

Related answers from Splunk Community

Lookup

Description

Function Input/Output Schema

Syntax

Required arguments

Optional arguments

Usage

Limitations for CSV lookups

How DSP handles data types for CSV lookups

How DSP handles data types for Splunk Enterprise KV Store lookups

Data types are defined in the KV Store collection

Data types are not defined in the KV Store collection

How to convert from Splunk Enterprise time to DSP time

SPL2 examples

1. Put corresponding information from a lookup dataset into your events

2. Rename and add a new field to your events from a lookup dataset

3. Lookup stores and return the corresponding state the store is in

4. Replace data in your events with data from a lookup dataset

Comments

Lookup

Was this topic useful?