On October 30, 2022, all 1.2.x versions of the Splunk Data Stream Processor will reach its end of support date. See the Splunk Software Support Policy for details.
Drift Detection (beta)
Drift Detection identifies large scale shifts and abrupt changes in a time-series data stream. Drift Detection is useful for understanding trends in data to detect a point in time when the distribution of data changes. This function may also be referred to as "changepoint detection."
The Drift Detection function identifies distributional change in a time series, like a metric or KPI. Examples of sudden changes that can be identified by Drift Detection include:
- Shift in mean or trend of a signal
- Increase or decrease in variance or noise of observed data
- Change in periodicity such as the interval between observed data points
This function requires a "timestamp" field in the incoming data stream. This requirement does not appear in the Streaming ML user interface because it is not configurable. For more information, see the Required arguments section.
Function Input/Output Schema
- Function Input
collection<record<R>>
- This function takes in collections of records with schema R.
- Function Output
collection<record<S>>
- This function outputs collections of records with schema S.
Syntax
- detect_drift
- value="input"
Required arguments
- input
- Syntax: double
- Description: The data included in your dataset to detect drift on. Drift Detection analyzes "input" data unless you override using the optional argument for "value".
- timestamp
- Syntax: long
- Description: The timestamp corresponding to the observed value in the data stream. Drift Detection requires a "timestamp" field in the dataset. This argument does not appear in the Streaming ML user interface because it is not configurable .
Optional arguments
- key
- Syntax: string
- Description: The field on which to partition the dataset to apply a different model per key. For example, if you are ingesting CPU metrics from 100 hosts and wants to learn a drift model per host, then 'host' is the key. If the dataset has a column titled "key" it will be applied by default. You can override the default by specifying "key=host" to choose a different "key" input.
- value
- Syntax: double
- Description: Set the 'value" argument if you want Drift Detection to analyze different data to that from the "input" argument. The "value" argument overrides "input" when used.
Usage
Drift Detection monitors the time series for drift. For each observed data point, Drift Detection outputs two values:
- Label
- Output
Label is returned as True or False, and is an indicator to identify if a datapoint represents a change point. A value of True indicates that the algorithm has detected drift, and the data point is the observed changepoint.
Output acts as a measure of confidence. Output is a probability score between 0 and 1.0. The closer output is to one, the more confident the algorithm will be in its predicted label.
Generally, when label=True
, the confidence is high. In some noisy signals, this may not be the case. In those scenarios, you can filter the output of the algorithm by the following condition:
| where output > threshold and label=True
A threshold of typically 0.7 - 0.9 can be applied to select the high confidence change points.
SPL2 example
The following example uses Drift Detection on Bytes Sent
by Source Address
:
| from splunk_firehose() | eval json=cast(body, "string"), 'json-map'=from_json_object(json), input=parse_double(ucast(map_get('json-map', "Bytes Sent"), "string", null)), key=ucast(map_get('json-map', "Source Address"), "string", null), time=ucast(map_get('json-map', "Start Time"), "string", null), timestamp=cast(div(cast(get("time"), "long"), 1000000), "long") | detect_drift value="input";
Datagen (beta) | Eval |
This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1-patch02, 1.2.1, 1.2.2-patch02, 1.2.4, 1.2.5
Feedback submitted, thanks!