Pairwise Categorical Outlier Detection (beta)
Pairwise Categorical Outlier Detection is the process of detecting anomalies in two categorical attributes which may be dependent on each other (given "A," is "B" anomalous). This function detects anomalous combinations of values from two categorical variables. Input variables should be categorical or a finite selection of non-categorical values including application name, IP, username, port, and hour of day.
Pairwise Categorical Outlier Detection predicts a rarity score that indicates if an observation is anomalous or not. Smaller values of rarity scores indicate that an observation is more likely to be anomalous.
This function requires two categorical fields in the input dataset. If the dataset contains columns titled "conditional" and "target" they will be applied by default. You can override these defaults and specify custom input fields. For more information, see the optional arguments section.
Function Input/Output Schema
- Function Input
- This function takes in collections of records with schema R.
- Function Output
- This function outputs collections of records with schema S.
If the dataset contains columns labelled "conditional" and "target", the data in those columns is applied by default. To learn how to override the default inputs and specify different fields for "conditional" and "target" inputs, see the Optional arguments section.
- Syntax: string
- Description: This field corresponds to the categorical attribute of the dataset that represents the condition on which to measure the rarity of the target variable. For example, if assessing the rarity of actions executed by users, the "conditional" field may be username or location. In mathematical notation P(A|B), this field represents B. If the dataset contains a column titled "conditional" the function will use it as default. To override this default, set the "conditional" argument to input a different field.
- Syntax: string
- Description: This field corresponds to the categorical attribute of the dataset on which you want to compute the rarity. In mathematical notation P(A|B), this field represents A. If the dataset contains a column titled "target" the function will use it as default. To override this default, set the "target" argument to input a different field.
This algorithm detects anomalies in categorical data. For each observed data point, Pairwise Categorical Outlier Detection outputs a rarity score. Rarity score is a positive, real value. The closer the rarity score is to 0, the more likely the data point is anomalous. Higher values indicate that an observed data point is less rare (i.e., not anomalous).
Pairwise Categorical Outlier Detection is useful in identifying anomalies on non-numeric data, or in use cases where contextual information about an observation is important in identifying an anomaly.
For example, transaction data can be monitored for fraud. Standard numeric outlier detection algorithms may erroneously flag all high-value transactions as suspicious. Instead, Pairwise Categorical Outlier Detection identifies anomalous transactions that are conditional on context, like User ID, Location, or Application. Using conditional fields establishes the baseline for normal observations within the reference population.
Security users may use Pairwise Categorical Outlier Detection to find anomalous access to a port by an application. To do so, set
The following example uses Pairwise Categorical Outlier Detection on a test set:
| from splunk_firehose() | eval json=cast(body, "string"), 'json-map'=from_json_object(json), conditional=ucast(map_get('json-map', "conditional"), "string", null), target=ucast(map_get('json-map', "target"), "string", null), key="" | conditional_anomaly conditional="conditional" target="target";
This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1