Splunk® Data Stream Processor

Function Reference

Acrobat logo Download manual as PDF


Acrobat logo Download topic as PDF

Datagen (beta)

Datagen is included with the ML plugin but under active development. Datagen is not visible in the Streaming ML user interface. Limit your use of Datagen to experimentation work.

Datagen is a Flink-native data source that generates various types of events. Use Datagen to generate synthetic data points such as timestamps, normally distributed variables, hash values, and ranged values. You can choose to instruct Datagen to generate a set number of synthetic data points before stopping.

You can define a template string that indicates which fields, and types of those fields, that need to be generated. Once defined, events are generated and Datagen adds in value as the field containing the entire string, along with individual fields in the template string. Datagen offers a optional params argument to define configurations of fields defined in the template string.

Function Input/Output Schema

Function Output
collection<record<R>>
This function outputs collections of records with schema R.

Syntax

from datagen
("{timestamp} {norm} {hash}",
{},
100);

Required arguments

format
Syntax: string
Description: Required template string to generate synthetic events. Parameters are enclosed in {}. Datagen replaces parameter configuration provided in the params argument with relevant value(s).
Example:| from datagen("Look at this number: {number}", {"number.type": "value", "number.value": "10"}, 1);

Optional arguments

fieldgen
Syntax: type
Description: A scalar function to generate any of the above generations on non-Datagen sources.
Example: fieldgen(type, params)
interval
Syntax: integer
Description: Defines how frequently (in milliseconds) events should be emitted.
Example: {"interval": 1000},
n
Syntax: long
Description: If defined, Datagen generates "n" events and terminates afterwards.
p
Syntax: integer
Description: If defined Datagen creates "p" parallel instances. Each parallel instance will generate n (defined) events. Parallelism set for the entire job will be preferred over this setting.
Example: {"p": 4},
params
Syntax: map (string, any)
Description: Map for replacement configuration of parameters defined in format. Datagen replaces parameter configuration provided in the params argument with relevant value(s).
seed
Syntax: integer
Description: If defined, this will be used as seed for all random operations. Set this for deterministic behavior for use in tests.
Example: 42

Generators

It is not necessary to define field.type on every field you want to generate. The type of a field can be used as shorthand. The following two statements are equivalent:

| from datagen("{timestamp}")
| from datagen("{field}", {"field.type", "timestamp"});

You can also set additional configurations with the following shorthand notation. This shorthand is applicable to all Generator fields listed:
| from datagen("{range}", {"range.max": 1024});

eps
Syntax: long
Description: Experimental feature to generate timestamps per defined rate. If defined, timestamps are spread uniformly per second. This is a simulated "eps" that works at the normal rate of event generation. Note that "eps=2" doesn't mean only two events will be generated per second in real time, but means that timestamps outputted will be spread apart by 500ms.
Example: 2
hash
Syntax: integer
Description: Generates a random alphanumeric string of a chosen length.
Example: {"field.type": "hash", "field.length": 64});
ipv4
Syntax: string
Description: Replaces with an IPV4 value.
Example: {"field.type": "ipv4"});
ipv6
Syntax: string
Description: Replaces with an IPV6 value.
Example: {"field.type": "ipv6"});
integerid
Syntax: integer
Description: Assigns an incremental integer value.
Example: {"field.type": "integerid", "field.start": 1000});
list
Syntax: string
Description: Outputs a random value from a provided comma separated values list.
Example: {"field.type": "list", "field.values": "debug,info,warning"});
norm
Syntax: string
Description: Generates a random Gaussian variable.
Example: {"field.type": "norm"});
range
Syntax: integer
Description: Replaces with an integer or float value within a provided range.
Example: {"field.type": "range"});
seqlist
Syntax: collection<string>
Description: Picks one element from list sequentially and returns to 0th index once exhausted.
Example: {"field.type": "seqlist", "field.values": "debug,info,warning"});
timestamp
Syntax: string
Description: Defines how to format the time and follows "strptime"/ "strftime" semantics. To output a unix epoch with millisecond precision, use "%s".
Example: {"field.type": "timestamp", "field.format": "%b %d %H:%M:%S"});
value
Syntax: string
Description: Replaces variable with chosen value.
Example: {"field.type": "value", "value", "10"}, 100);

Usage

You can set up parallel instances using p parameter to increase overall throughput.

SPL2 example

The following example uses Datagen on a test set. The number 100 in the example tells Datagen how many events to create before stopping.

| from datagen("{timestamp} {norm} {hash}", {}, 100);
Last modified on 02 September, 2021
PREVIOUS
Break Events
  NEXT
Drift Detection (beta)

This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1


Was this documentation topic helpful?

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters