Datagen (beta)

Datagen is included with the ML plugin but under active development. Datagen is not visible in the Streaming ML user interface. Limit your use of Datagen to experimentation work.

Datagen is a Flink-native data source that generates various types of events. Use Datagen to generate synthetic data points such as timestamps, normally distributed variables, hash values, and ranged values. You can choose to instruct Datagen to generate a set number of synthetic data points before stopping.

You can define a template string that indicates which fields, and types of those fields, that need to be generated. Once defined, events are generated and Datagen adds in value as the field containing the entire string, along with individual fields in the template string. Datagen offers a optional params argument to define configurations of fields defined in the template string.

Function Input/Output Schema

Function Output: collection<record<R>>; This function outputs collections of records with schema R.

Syntax

from datagen: ("{timestamp} {norm} {hash}",; {},; 100);

Required arguments

format: Syntax: string; Description: Required template string to generate synthetic events. Parameters are enclosed in {}. Datagen replaces parameter configuration provided in the params argument with relevant value(s).; Example:| from datagen("Look at this number: {number}", {"number.type": "value", "number.value": "10"}, 1);

Optional arguments

fieldgen: Syntax: type; Description: A scalar function to generate any of the above generations on non-Datagen sources.; Example: fieldgen(type, params)

interval: Syntax: integer; Description: Defines how frequently (in milliseconds) events should be emitted.; Example: {"interval": 1000},

n: Syntax: long; Description: If defined, Datagen generates "n" events and terminates afterwards.

p: Syntax: integer; Description: If defined Datagen creates "p" parallel instances. Each parallel instance will generate n (defined) events. Parallelism set for the entire job will be preferred over this setting.; Example: {"p": 4},

params: Syntax: map (string, any); Description: Map for replacement configuration of parameters defined in format. Datagen replaces parameter configuration provided in the params argument with relevant value(s).

seed: Syntax: integer; Description: If defined, this will be used as seed for all random operations. Set this for deterministic behavior for use in tests.; Example: 42

Generators

It is not necessary to define field.type on every field you want to generate. The type of a field can be used as shorthand. The following two statements are equivalent:

| from datagen("{timestamp}") | from datagen("{field}", {"field.type", "timestamp"});

You can also set additional configurations with the following shorthand notation. This shorthand is applicable to all Generator fields listed:
| from datagen("{range}", {"range.max": 1024});

eps: Syntax: long; Description: Experimental feature to generate timestamps per defined rate. If defined, timestamps are spread uniformly per second. This is a simulated "eps" that works at the normal rate of event generation. Note that "eps=2" doesn't mean only two events will be generated per second in real time, but means that timestamps outputted will be spread apart by 500ms.; Example: 2

hash: Syntax: integer; Description: Generates a random alphanumeric string of a chosen length.; Example: {"field.type": "hash", "field.length": 64});

ipv4: Syntax: string; Description: Replaces with an IPV4 value.; Example: {"field.type": "ipv4"});

ipv6: Syntax: string; Description: Replaces with an IPV6 value.; Example: {"field.type": "ipv6"});

integerid: Syntax: integer; Description: Assigns an incremental integer value.; Example: {"field.type": "integerid", "field.start": 1000});

list: Syntax: string; Description: Outputs a random value from a provided comma separated values list.; Example: {"field.type": "list", "field.values": "debug,info,warning"});

norm: Syntax: string; Description: Generates a random Gaussian variable.; Example: {"field.type": "norm"});

range: Syntax: integer; Description: Replaces with an integer or float value within a provided range.; Example: {"field.type": "range"});

seqlist: Syntax: collection<string>; Description: Picks one element from list sequentially and returns to 0th index once exhausted.; Example: {"field.type": "seqlist", "field.values": "debug,info,warning"});

timestamp: Syntax: string; Description: Defines how to format the time and follows "strptime"/ "strftime" semantics. To output a unix epoch with millisecond precision, use "%s".; Example: {"field.type": "timestamp", "field.format": "%b %d %H:%M:%S"});

value: Syntax: string; Description: Replaces variable with chosen value.; Example: {"field.type": "value", "value", "10"}, 100);

Usage

You can set up parallel instances using p parameter to increase overall throughput.

SPL2 example

The following example uses Datagen on a test set. The number 100 in the example tells Datagen how many events to create before stopping.

| from datagen("{timestamp} {norm} {hash}", {}, 100);

Datagen (beta)

Function Input/Output Schema

Syntax

Required arguments

Optional arguments

Generators

Usage

SPL2 example

Comments

Datagen (beta)

Was this topic useful?