Splunk® Data Stream Processor

Function Reference

On October 30, 2022, all 1.2.x versions of the Splunk Data Stream Processor will reach its end of support date. See the Splunk Software Support Policy for details. For information about upgrading to a supported version, see the Upgrade the Splunk Data Stream Processor topic.

Lookup

This topic describes how to use the lookup function in the Splunk Data Stream Processor.

Description

Use the lookup function to enrich your streaming data with related information that is in a lookup dataset. Field-value pairs in your DSP records are matched with field-value pairs in a lookup dataset.

To use this function, you must first upload a lookup file or connect to a Splunk Enterprise KV Store. Use a CSV lookup if you are using relatively small, static sets of data and a KV Store lookup if you are using large sets of data that you plan to modify frequently.

You must restart any active pipelines that use a lookup if one of the following options applies to you.

  • You are modifying the CSV lookup file being used in an active pipeline.
  • You are modifying the KV Store lookup connection being used in an active pipeline.
  • You are making any changes to the schema of the lookup as represented in transforms.conf that is being used in a KV store lookup connection.

You do not need to restart any active pipelines that are using a KV Store lookup if you are adding, updating, or deleting data in the KV Store collection. However, you do need to restart any active pipelines if you are modifying the schema of the KV Store collection.

DSP caches KV Store lookup data for a period of time. See Connect to the Splunk Enterprise KV Store using the Streams API page for information on how to tune caching in your DSP lookups.

Function Input/Output Schema

Function Input
collection<record<R>>
This function takes in collections of records with schema R.
Function Output
collection<record<S>>
This function outputs the same collection of records but with a different schema S.

Syntax

The required syntax is in bold.

lookup <lookup_dataset> (<lookup_field> [AS <event_field>] )...
[ (OUTPUT | OUTPUTNEW) ( <lookup_destfield> [AS <event_destfield>] )...]

Required arguments

lookup_dataset
Display name in the Canvas View: Lookup Name
Syntax: <string>
Description: The name of the lookup connection. Before using this function, you must either upload a CSV table or connect to a Splunk Enterprise KV Store. See About lookups.
Example in Canvas View: The Lookup Name field is set to store_info.
lookup_field
Display name in the Canvas View: Lookup Field
Syntax: <string>
Description: A field in the lookup dataset to match against incoming data. You can specify multiple lookup_field values, separated by commas.
Example in Canvas View: The Lookup Field field is set to store_number.

Optional arguments

event_field
Display name in the Canvas View: Incoming Stream Field
Syntax: AS <string>
Description: The field that contains the values to match in the lookup_field in the lookup dataset. You can specify multiple event_field values. If the name of the event_field is the same as the lookup_field, you don't need to specify the event_field. The name of the lookup_field is used.
Default: Empty. The name specified in the lookup_field argument is used.
Example in Canvas View: The Incoming Stream Field field is set to Store.
OUTPUT | OUTPUTNEW
Display name in the Canvas View: Overwrite Record Fields with Lookup Values
Syntax: OUTPUT | OUTPUTNEW
Description: Select an output mode. In OUTPUTNEW mode, values from the lookup dataset are only added to the DSP record if that record has a null value for that field or is missing that field entirely. In OUTPUT mode, all field names and values from the lookup dataset are added to the outgoing DSP records. Therefore, output fields that were already in the records are overwritten with values from the lookup.
Default: OUTPUT
Example in Canvas View:
  • For OUTPUT mode: The Overwrite Record Fields with Lookup Values check box is selected.
  • For OUTPUTNEW mode: The Overwrite Record Fields with Lookup Values check box is clear.
lookup_destfield
Display name in the Canvas View: Lookup Field to Add
Syntax: <string>
Description: A field in the lookup table to be applied to the outgoing data. You can specify multiple lookup_destfield values. Used with OUTPUT | OUTPUTNEW to replace or append field values. If no lookup destination fields are specified, all fields in the lookup dataset that you did not put in the lookup_field list are added to the outgoing DSP records.
Default: Empty. All fields in the lookup dataset that are not specified in the lookup_field list are added to the outgoing DSP records.
Example in Canvas View: The Lookup Field to Add field is set to name.
event_destfield
Display name in the Canvas View: Name of the New Field
Syntax: AS <string>
Description: A field in the outgoing data. You can specify multiple event_destfield values. If the name of the event_destfield is the same as the lookup_destfield, you don't need to specify the event_destfield. The name of the lookup_destfield is used. Used with OUTPUT | OUTPUTNEW to replace or append field values.
Default: Empty. The name specified in the lookup_destfield argument is used.
Example in Canvas View: The Name of the New Field field is set to store_name.

Usage

Limitations for CSV lookups

The cumulative size of all CSV lookups in a single pipeline cannot exceed 50MB. For example, in a single pipeline, you can use one 50 MB CSV lookup file or five 10MB files.

How DSP handles data types for CSV lookups

Each result of a lookup can be either a string element, or a list of string elements. Therefore the types for each of the results are a union<string, collection<string>>.

For example, the following is a snippet of the lookup table whois that contains the columns host, ip, and user.

host ip user
mailsv1 127.0.0.1 Elliot Cooley
mailsv2 192.0.0.7 Tabitha Hagan

After uploading the lookup table to DSP, you can then use this lookup in a pipeline.

| from splunk_firehose()
| lookup 'whois' host
| eval ip_addresses = mvappend(ip)
| eval connections=for_each(iterator(ip_addresses, "i"), concat(cast(user, "string"), "@", i))
| ...;

The user and ip fields are added as new top-level fields in your outgoing data. Because both of these fields came from the uploaded CSV lookup table and all values in a CSV lookup table are strings, the new data type of both of these fields are union<string, collection<string>>. The data type is a union, because the value of this field could be either a string or a list of strings. If you want to reference the new user and ip fields downstream in your pipeline, you'll need to convert the fields from a union data type to a more specific data type. In the above SPL2, the user and ip fields are turned into type string.

How DSP handles data types for Splunk Enterprise KV Store lookups

The Splunk Enterprise KV Store is a generic store of key-value data where you can store data with a limited set of types. Since DSP has a more robust typing system, a mapping must be made from the KV Store types to DSP types. To start, a DSP lookup must determine the schema (data types) of the KV Store fields. It can do that by one of two ways.

  • (Recommended) DSP checks the Splunk Enterprise KV Store to see if a schema was defined for the KV Store in collections.conf.
  • As an emergency fallback, if the KV Store does not have a user-defined schema, then DSP will sample 10 records from the KV Store collection to try and infer the data type of the fields in the KV Store collection.

Data types are defined in the KV Store collection

To ensure the proper mapping of KV Store types to DSP types, best practices are to define and enforce data types for fields in the KV Store collection. To learn how to define and enforce data types for the KV Store collection, see Use configuration files to create a KV Store collection in the Splunk Developer Portal.

The following table describes supported Splunk Enterprise KV Store data types and its DSP data type equivalent.

Splunk Enterprise KV Store Data Type Splunk Data Stream Processor Data Type Notes
string string
number double By default, DSP maps KV Store type number to DSP type double. However, if the DSP record field you are mapping to has a different numeric type, such as long or integer, then DSP uses that numeric type instead.

Be careful when using numeric fields in a KV store as your lookup key. Numbers in the Splunk Enterprise KV store are stored as floating point numbers and when queried as a lookup, the database query attempts an exact match. Due to numerical precision of floating points, semantically equivalent operations may yield different results (eg. x+y != y+x in some cases). Values as 1.0 and 1.000002 are considered not equal. Best practices are to avoid using fields using floating point numbers as your lookup key by casting it to a different numeric type when possible, such as a long.

bool boolean
time double DSP and Splunk Enterprise both use Unix epoch time, but represent and store timestamps in different ways. In Splunk Enterprise, time is defined in Unix epoch time and stored in a fixed point numeric format. When you store Splunk Enterprise time values in the Splunk Enterprise KV store, it is saved as a floating point number. In DSP, timestamp values are stored in the timestamp field in epoch time format in milliseconds.

To convert the Splunk Enterprise time format to the DSP time format, use the parse_millis and parse_nanos functions.

Because time is also a numeric data type, the note in the number section about exact matching for floating point numbers also applies here.

array collection<any>
cidr string CIDR matching is not currently supported in DSP.

The Splunk Enterprise KV Store defines two built-in fields for all records: _key and _user, both of type string. If you want to use either of those fields in a DSP lookup, you'll need to include them in your schema definition in the collections.conf file.

Data types are not defined in the KV Store collection

If you do not define data types in the KV Store collection, DSP attempts to infer the data type of your KV Store collection fields. In order to check that the KV Store schema is valid with DSP, the DSP lookup function samples 10 records from the KV Store collection in order to infer the data type of the KV Store collection fields. Because the KV Store collection supports a limited set of data types, DSP only attempts to infer whether the data type should be a string, number, boolean, or array. If there are any type conflicts, an error is shown when activating a pipeline containing this lookup.

If there are any fields where all of the values are null, then a subsequent query is done to the KV store asking for a row where that field has a non-null value. The type is inferred from the row returned. If no rows are returned, then an error is shown when activating a pipeline containing this lookup.

In order for DSP to infer the schema of the KV Store collection, the collection must have the following properties:

  • The data in the collection must have a consistent schema.
  • Fields in the collection must be consistently typed, and a field in the collection should not have two different types. If there are at least two rows containing the same fields that have different types, then DSP cannot infer the schema.
  • Fields in the collection must have at least one non-null value.
  • The collection cannot be empty.

How to convert from Splunk Enterprise time to DSP time

In a Splunk Enterprise KV Store collection, time is a special data type. When you are using the time data type, you are using the Splunk Enterprise time format which is in Unix epoch time. The Splunk Enterprise format is a fixed point format of seconds.[millis][nanos].

  • seconds is seconds since epoch.
  • millis is the milliseconds modifier and can be up to 3 digits.
  • nanos is the nanoseconds modifier and can be up to 6 digits.

Because time is not a recognized data type in DSP, you'll need to convert Splunk Enterprise time into a DSP time format. DSP stores time in two fields: timestamp and nanos.

  • timestamp is milliseconds since epoch.
  • nanos is the nanoseconds modifier of timestamp.

Timestamps are stored in Splunk Enterprise KV Store collections as float-64 values, and therefore, you may lose precision when converting from a very small or very large number.

If you are performing lookups to a field that uses the Splunk Enterprise time format, and you want to convert that to DSP time format, use the parse_millis and parse_nanos scalar functions after the lookup function.

SPL2 examples

Examples of common use cases follow. The following examples in this section assume that you are in the SPL View.

When working in the SPL View, you can write the function by providing the arguments in the exact order shown in each use case.

1. Put corresponding information from a lookup dataset into your events

This example enriches incoming streaming data with the data contained in the store_info lookup dataset using the store field.

... | lookup store_info Store AS Store;

The store_info lookup dataset contains this data:

Store Name State
store1 Tom's Diner CA
store2 Jill's Diner CA
store3 Fred's Diner FL

The incoming DSP records look something like this:

Timestamp Body Store Revenue
1589474463634 store1, 10 store1 10
1589474463634 store2, 15 store2 15
1589474463634 store3, 20 store3 20

After configuring the lookup function, the outgoing data is now enriched with the name and state fields from the store_info lookup dataset.

Name State Timestamp Body Store Revenue
Tom's Diner CA 1589474463634 store1, 10 store1 10
Jill's Diner CA 1589474463634 store2, 15 store2 15
Fred's Diner FL 1589474463634 store3, 20 store3 20

2. Rename and add a new field to your events from a lookup dataset

This example adds a new top-level field called name to your data from the lookup table file, and renames that field to store_name.

... | lookup 'store_info' 'store_number' AS store OUTPUT name AS store_name, state AS state | ...;

The store_info lookup dataset contains this data:

store_number name state
store1 Tom's Diner CA
store2 Jill's Diner CA
store3 Fred's Diner FL

The incoming DSP records look something like this:

Store Revenue
store1 10
store2 15
store3 20

After configuring the lookup function, the outgoing data adds the Name field to your data and renames the field to store_name.

store_name Revenue Store state
Tom's Diner 10 store1 CA
Jill's Diner 15 store2 CA
Fred's Diner 20 store3 FL

3. Lookup stores and return the corresponding state the store is in

In this example, one of the fields in the incoming data is missing its value. This example looks up the missing field value from the lookup file, and uses OUTPUTNEW to fill in the missing value. In addition, this lookup adds the store name to the outgoing data.

... | lookup 'store_info' 'Store Number' AS Store OUTPUTNEW State, Name;

The store_info lookup dataset contains this data:

Store Number Name State
store1 Tom's Diner CA
store2 Jill's Diner CA
store3 Fred's Diner FL

The incoming DSP records look like:

Store Revenue State
store1 10 CA
store2 15 IL
store3 20

After configuring the lookup function, the outgoing data is now enriched with the name fields from the store_info lookup dataset and the missing state values are filled in. Note that even though the CSV lookup dataset contains "CA" as the State for store2, the lookup function using OUTPUTNEW only updates fields in your streaming data if the field value is empty or null. It does not replace any existing field values, even if they differ from the field values in the lookup dataset. Thus, in your outgoing DSP records, the State for store2 is still IL.

Store Revenue State Name
store1 10 CA Tom's Diner
store2 15 IL Jill's Diner
store3 20 FL Fred's Diner

4. Replace data in your events with data from a lookup dataset

This example replaces fields in your incoming data with data in the lookup dataset.

... | lookup 'store_info' 'Store Number' AS Store OUTPUT State | ...;

The store_info lookup dataset contains this data:

Store Number Name State
store1 Tom's Diner CA
store2 Jill's Diner CA
store3 Fred's Diner FL

The incoming DSP records look like:

Store Revenue State
store1 10 WI
store2 15 IL
store3 20 FL

After configuring the lookup function, the values of the State field have been replaced with the values of the State field from the lookup table.

Store Revenue State
store1 10 CA
store2 15 CA
store3 20 FL
Last modified on 09 February, 2022
Key_by   Merge Events

This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1-patch02, 1.2.1, 1.2.2-patch02, 1.2.4, 1.2.5, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.4.5, 1.4.6


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters