Using the score command

In Splunk's Machine Learning Toolkit, the score command runs statistical tests to validate model outcomes. Use the score command for robust model validation and statistical tests in any use case.

The score command is only available for use on version 4.0.0 of the MLTK or above. Version 4.0.0 of the toolkit has the additional requirement of version 1.3 of the Python for Scientific Computing add-on.

The Splunk Machine Learning Toolkit uses the following classes of the score command, each with their own sets of methods (ie. Accuracy, F1-score, T-test etc):

Score commands are not customizable within the Splunk Machine Learning Toolkit.

The Splunk Machine Learning Toolkit also enables the examination of how well your model might generalize on unseen data by using folds of the training set. This method is known as K-fold scoring.

The kfold command does not use the score command, but operates more like a special kind of scoring.

Classification

Classification scoring metrics are used for evaluating the predictive power of a classification learning algorithm.

Classification scoring in the Splunk Machine Learning Toolkit includes the following methods:

Overview

The most common use of classification scoring is to evaluate how well a classification model performs on the test set, as in the following example:

Load data; split data into training (<=70 partitions) and test (>70 partitions) sets; fit a classification model and save as a model object.

Select the training set; apply the model to get predictions on unseen data; perform scoring; analyze result.

The syntax also supports the comparison of multiple fields, allowing for multi-field comparisons. This is useful when evaluating which classification model is best suited for your data.

Training multiple models on the same field.

Evaluating the ground-truth field against multiple predictions. Informative warning messages appear as they may occur.

The inputs to the classification scoring methods are actual and predicted field(s), corresponding to ground-truth-labels and predicted-labels, respectively.

Classification scoring methods will only work on categorical data (integers and string-types, but not floats), and are expected to be used to evaluate the output of classification algorithms, such as logistic regression.

Attempting to score on numeric (float-type) data throws the error message: "Value error: field field_name is not a valid categorical field since it contains floats with ambiguous categorical representation."

Preprocessing

All classification scoring methods follow the same preprocessing steps:

Search commands pulled into memory. 
Data preparation including:
1. All rows containing NAN values removed prior to computing the score.  
2. Errors raised should any categorical fields be found.

Common parameters

The pos_label parameter:

Must be an element of all actual (ground-truth) data
Is used to specify the positive class when average=binary
Will throw the following error if not present:
- ""Value error: value <pos_label> for pos_label not found in actual field actual_field. Please specify a valid value for pos_label."
Is completely ignored when the average is not binary and throws the following warning:
- "Warning: pos_label will be ignored since average is not binary (found average=<average>)"

Uses of the pos_label parameter:

When average=binary and the combined cardinality of actual/predicted field is <= 2 (applicable for accuracy, precision, recall, f1, precision_recall_f1_support)
- Report results for class=pos_label only
When data is multiclass but we would like to apply binary scoring methods (applicable for ROC-curve and ROC-AUC-score)
- ROC-curve and ROC-AUC-score require binary data – the pos_label allows multiclass data to be cast to binary by specifying the class identified by "pos_label" as the positive class, and all other classes negative.
- For example if original multiclass data is [a, b, c, a, e] and pos_label=a, then resulting binary data is [1, 0, 0, 1, 0]

The average parameter options include:

None:
- Return the scoring metric for each unique class in the union of actual-field + predicted_field
Binary:
- To report results for the class specified by pos_label
- Parameter only works for binary data. If applied to a multiclass problem, the following error message generates: "Value error: Target is multiclass but average=binary. Please choose another average setting."
Micro:
- Calculate metrics globally by counting the total true positives, false negatives and false positives.
Macro:
- Calculate metrics for each label, and find their unweighted mean. Does not take label imbalance into account.
Weighted:
- Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance and can result in an F-score that is not between precision and recall.

Syntax
As with all scoring methods, classification methods support pairwise (ie. comparing a_field_i with b_field_i) comparisons between two sets of fields (arrays). The general syntax is as follows:

.. | score <scoring-method-name> array_a against array_b [options]

againstseparates the ground-truth fields (on the left) from the predicted fields (on the right)
- "~" is equivalent
array_a represents the ground-truth fields, and is specified by fields actual_field_1 ... actual_field_n
array_b represents the predicted fields, and is specified by fields predicted_field_1 ... bpredicted_field_n

SPL syntax

.. | score <scoring-method-name> <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> [options]

Accuracy scoring

Get the prediction accuracy between actual-labels and predicted-labels.

Implements sklearn.metrics.accuracy_score. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

Further reading: https://en.wikipedia.org/wiki/Accuracy_and_precision

Parameters:

The normalize parameter default is true.
The normalize parameter dictates whether to return the raw count of correctly classified samples (normalize=False) or the fraction of correctly classified samples (normalize=True).

Syntax

...|score accuracy_score <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> normalize=<True|False>

Syntax constraints

Accuracy supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Example

We manually specify fields (particularly for second call to fit and onwards) since a predicted field will exist in the data.

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number <= 70
 
| fit LogisticRegression vehicleType from batteryVoltage engineCoolantTemperature engineSpeed into LR_model
| fit DecisionTreeClassifier vehicleType from batteryVoltage engineCoolantTemperature engineSpeed into DT_model

After training a classifier to predict vehicle type, we can analyze our test-set accuracy.

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| apply LR_model as LR_prediction
| apply DT_model as DT_prediction
| score accuracy_score vehicleType against LR_prediction DT_prediction

Confusion matrix

Get the confusion matrix for the actual-labels and predicted-labels.

Implements sklearn.metrics.confusion_matrix. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Further reading: https://en.wikipedia.org/wiki/Confusion_matrix

Parameters:

Confusion matrix takes no parameters.

Syntax

score confusion_matrix <actual_field> against <predicted_field>

Syntax constraints

The ground-truth-labels map along the event-axis (vertical), the predicted-labels map along the field-axis (horizontal).
Only works for 1-1 comparisons, since the output of confusion_matrix is already 2d

Although order is not preserved in the output fields/events, the correspondence of fields and events is preserved. For example the third field will correspond to the third event.

Example

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| apply DT_model as DT_prediction
 
| score confusion_matrix vehicleType against DT_prediction

Example visualization

In the following visualization of the confusion matrix, we can see which classes were most and least successfully predicted, as well as what they were mistaken for.

F1-score

Get the f1-score between true-labels and predicted-labels.

Implements sklearn.metrics.f1_score. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Further reading: https://en.wikipedia.org/wiki/F1_score

Parameters:

The pos_labelparameter default is 1.
The average parameter default is binary.

Syntax

|score f1_score <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> average=<binary(default) | micro | macro | weighted> pos_label=<str | int>

Syntax constraints

F1-score supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Example

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| apply LR_model as LR_prediction
| apply DT_model as DT_prediction
 
| score f1_score vehicleType against LR_prediction DT_prediction average=micro

Example visualization

The following plot shows model f1 score on the test set. LogisticRegression results on the left, and DecisionTree results on the right. We see the average across all vehicle types (average=micro).

Precision

Get the precision between actual-labels and predicted-labels.

Implements sklearn.metrics.precision_score. Learn more here:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

Further reading: https://en.wikipedia.org/wiki/Accuracy_and_precision

Parameters:

The pos_labelparameter default is 1.
The average parameter default is binary.

Syntax

...|score precision_score <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> average=<binary(default)|micro|macro|weighted> pos_label=<str|int>

Syntax constraints

Precision supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Example

In the following example, after training a classifier to predict vehicle type, we evaluate the model's precision on the training set for each vehicle type.

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| apply LR_model as LR_prediction
| apply DT_model as DT_prediction
 
| score precision_score vehicleType against LR_prediction DT_prediction average=None

Example visualization

The following visualization plot shows the model precision on the test set, for each vehicle type. Precision using LogisticRegression is on the left, DecisionTree on the right. A warning shows that 2% of the total rows contained NAN values, and have been removed.

Precision-Recall-F1-Support

Get the precision, recall, f1-score and support between actual-fields and predicted-fields. A common example is a classification report.

Implements sklearn.metrics.precision_recall_fscore_support. Learn more here :http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

Parameters:

The pos_labelparameter default is 1.
The average parameter default is none.
The beta parameter default is 1.0.
The beta parameter shows strength of recall versus precision in f-score.

Syntax

score precision_recall_fscore_support <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> pos_label=<str> average=<str> beta=<float>

Syntax constraints

Average	Result
Not None	Works for all syntax constraints ie. 1-to-1 and 1-to-n)
None	Only works for 1-1 comparisons since the output of `precision_recall_fscore_support` (when average = None) is already 2d

Support is only defined when average=None because support cannot be averaged.

Example

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| apply LR_model as LR_prediction
| apply DT_model as DT_prediction
 
| score precision_recall_fscore_support vehicleType against LR_prediction DT_prediction average=weighted

Example visualization

In the following visualization, the plot shows the precision, recall and f_beta scores for the prediction of vehicle type, under a weighted averaging scheme.

Recall

Get the recall between actual-labels and predicted-labels.

Implements sklearn.metrics.recall_score. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

Further reading: https://en.wikipedia.org/wiki/Precision_and_recall

Parameters:

The pos_labelparameter default is 1.
The average parameter default is binary.

Syntax

|score accuracy_score <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> average=<binary(default) | micro | macro | weighted> pos_label=<str | int>

Syntax constraints

Recall supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Example

| inputlookup track_day.csv
| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| apply LR_model as LR_prediction
| apply DT_model as DT_prediction
 
| score recall_score vehicleType against LR_prediction DT_prediction average=weighted

Example visualization

The following plot shows model recall on the test set. LogisticRegression is on the left, DecisionTree on the right. We see the average across all vehicle types, where the average is weighted by the support.

ROC-AUC-score

Get the ROC-AUC-score for the actual-labels and predicted-scores.

Implements sklearn.metrics.roc_auc_score. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

For further reading see the AUC references here: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Parameters:

Although sklearn.metrics.roc_auc_score supports an averageparameter, this is currently disabled as the toolkit does not support the label-indicator format.
- For multiclass problems, binarize the data using pos_label.
The pos_label parameter default is 1.
- Class to represent positive (when multiclass target, makes all others negative)
When the predicted field contains target scores, it can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions.

Requirements:

ROC_AUC_score only applies to binary data. To support multi-class problems, convert into binary with the pos_label parameter.
- See table below for more details.
The predicted-field must be numeric. Numeric must be float or integer type, corresponding to probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).
- If the predicted field does not meet the numeric criteria, this error message will display: "Expected field <predicted> to be numeric and correspond to probability estimates or confidence intervals of the positive class, but field contains non-numeric events."

If	Then
Binary data is given	It must be true-binary ie. {0,1} or {-1,1}
Binary is not data (ie. multiclass)	The `pos_label` must be specified and contained in the `ground_truth` field
Binary us not true-binary	The `pos_label` must be specified and contained in the `ground_truth` field

If the pos_label is not in the ground_truth field, the following error will be raised: "Value not found in ground-truth field . Please specify a valid value for pos_label."

If the ground-truth data is multiclass and the pos_label is properly specified, the user is warned of the conversion: ""Found multiclass ground-truth field . Converting to true binary by setting as positive and all other classes as negative."

Syntax

score roc_auc_score <a1> <a2> ... <an> against <b1> <b2> ... <bn> [pos_label=<str | int>]

Syntax constraints

ROC AUC score supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Example

In the following example we obtain the area under the ROC curve for predicting the vehicle type of 2013 Audi RS5.

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number <= 70
 
| fit LogisticRegression vehicleType from * probabilities=True into LR_model

| inputlookup track_day.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| score roc_auc_score vehicleType against "probability(vehicleType=2013 Audi RS5)" pos_label="2013 Audi RS5"

Example output

ROC-curve

Get the ROC-curve for actual-fields and predicted-fields.

Implements sklearn.metrics.roc_curve. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

Further reading: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Parameters:

The pos_label parameter default is 1.
- Class to represent positive (when multiclass target, makes all others negative). See examples 1 and 2 below.
The drop_intermediate parameter default is true.
- Whether to drop some sub-optimal thresholds which would not appear on a plotted ROC curve. This is useful in order to create lighter ROC curves.

Example 1: When binary data not true binary

Data = <Y N Y N Y N >
pos_label= Y
Data transformed to true binary as <1 0 1 0 1 0>

Example 2: When multiclass data not true binary

Data = <A B C B A B >
pos_label = A
Data transformed to true binary as <1 0 0 0 0 1 0>

Requirements:

ROC_curve only applies to binary data. To support multi-class problems, convert into binary with the pos_label parameter.
- See table below for more details.
The predicted-field must be numeric. Numeric must be float or integer type, corresponding to probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (ie. as returned by by LogisticRegression with probabilities=True).
- If the predicted field does not meet the numeric criteria, this error message will display: "Expected field <predicted> to be numeric and correspond to probability estimates or confidence intervals of the positive class, but field contains non-numeric events."

If	Then
Binary data is given	It must be true-binary ie. {0,1} or {-1,1}
Binary is not data (ie. multiclass)	The `pos_label` must be specified and contained in the `ground_truth` field
Binary is not true-binary	The `pos_label` must be specified and contained in the `ground_truth` field

If the pos_label is not in the ground_truth field, the following error will be raised: "Value not found in ground-truth field . Please specify a valid value for pos_label."

If the ground-truth data is multiclass and the pos_label is properly specified, the user is warned of the conversion: ""Found multiclass ground-truth field . Converting to true binary by setting as positive and all other classes as negative."

Syntax

score roc_curve <actual_field> against <predicted_field> pos_label=<str|int> drop_intermediate=<True|False>

Syntax constraints

Only works for 1-1 comparisons.

Example

| inputlookup churn.csv
 
| sample partitions=100 seed=1234
| search partition_number <= 70
 
| fit LogisticRegression Churn? from * probabilities=True into LR_model

| inputlookup churn.csv
 
| sample partitions=100 seed=1234
| search partition_number > 70
 
| apply LR_model probabilities=True
| score roc_curve Churn? against "probability(Churn?=True.)" pos_label='True.'

Example visualization

In the following visualization, the plot shows how the true-positive-rate (tpr) varies with the false-positive-rate (fpr), along with the corresponding probability thresholds.

Clustering scoring

Clustering scoring is used to evaluate the predicted value of the clustering model. The inputs to the clustering scoring methods are arrays of data specified by an ordered sequence of fields:

Clustering scoring in the Splunk Machine Learning Toolkit includes the following methods:

Silhouette score

Overview

Clustering scoring methods may operate on two arrays:

We adopt the notation of label and features which are specified by the ordered sequence of fields <label_field> and feature_field_1 feature_field_2 ... feature_field_n respectively.
We've introduced the against clause to separate the arrays.
label_field against feature_field_1 feature_field_2 ... feature_field_n correspond to label (ground-truth or predicted labels) against features (features used by clustering algorithm), respectively.

Preprocessing

Clustering scoring methods follow the same preprocessing steps:

Search commands pulled into memory.
Data preparation including:
1. All rows containing NAN values are removed prior to computing the score.
2. Label field is converted to categorical.
3. Errors are raised if any categorical fields are found in feature fields.

Parameters

The metric parameter default is euclidean.
- Supported metric metric values include: cityblock, cosine, euclidean, l1, l2, manhattan, braycurtus, canberra, chebyshev, correlation, hamming, matching, minkowski, and sqeuclidean.

Example

 ..| score r2_score <label_array> against <feature_array>

Clustering scoring methods will only work on numerical data, and are expected to be used to evaluate the output of clustering models (such as KMeans, Spectral Clustering...etc).

Attempting to score on categorical data will result in the error: "Non-numeric fields found in the data: categorical_fields. Please provide only numerical fields"

The number of fields in label_array is limited to one and it has to be categorical.

Neither parameters that take a list/ array as input nor metrics which calculate the distance between categorical arrays are currently supported.

Silhouette score

Calculate the silhouette score between label_array and feature_array.

Implements sklearn.metrics.silhouette_score. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

Further reading: https://en.wikipedia.org/wiki/Silhouette_(clustering)

Parameters

Supports only common parameters.

Syntax

...|score silhouette_score <label_field> against <feature_field_1> ... <feature_field_n> metric=<euclidean(default) | cityblock | cosine | l1 | l2 | manhattan | braycurtis | canberra | chebyshev | correlation | hamming | matching | minkowski | sqeuclidean>

Syntax constraints

Silhouette score works for the following syntax constraints:

label_field must have only single field.
feature_fields can have single or multiple fields.

Example

| inputlookup iris.csv
| fit StandardScaler petal_length petal_width sepal_length sepal_width as scaled
| fit KMeans scaled_* k=3 as KMeans_predicted_species
| score silhouette_score KMeans_predicted_species against scaled_petal_length scaled_petal_width scaled_sepal_length scaled_sepal_width

Regression scoring

Regression scoring metrics are used for evaluating the predictive power of a regression learning algorithm. The most common use of regression scoring is to evaluate how well a regression model performs on the test set.

Regression scoring in the Splunk Machine Learning Toolkit includes the following methods:

The inputs to the regression scoring methods are arrays of data specified by an ordered sequence of field(s).

Regression scoring methods may operate on two arrays. For example

 ..| score r2_score <actual> against <predicted>

We adopt the notation of actual and predicted, which are specified by an ordered sequence of fields actual_field_1 ...actual_field_n and predicted_field_1 ... predicted_field_n.
We've introduced the against clause to separate the arrays.
actual_field_1 ... actual_field_n against predicted_field_1 ... predicted_field_n. fields, corresponding to actual (ground-truth-target-values) against predicted (predicted-target-values), respectively.

These scoring methods will only work on numerical data, and are expected to be used to evaluate the output of regression algorithms (such as Gradient Boosting Regression, Linear Regression...etc).

Attempting to score on categorical data returns the error: "Non-numeric fields found in the data actual_field/predicted_field. Please provide only numerical fields."

No numerical fields in any of the arrays returns the error "Value error: arrays cannot be empty. No valid numerical fields exist to populate" or "Value error: No valid numerical fields exist for scoring method."

The number of fields in actual_fields and predtcted_fields must be either equal to each other or one of them should have only one field. If one of the arrays has a single field and the other array has multiple fields, the multioutput parameter is set to raw_values and the regression score calculated between each field of the array which has multiple fields and the same field of the array which has a single field. Also, in this case, if the multioutputparameter was set to a different value by the user before, an error is returned: "Updating multioutput value from <X> to raw_values since <a_array/b_array> has a single column.

Preprocessing

All regression scoring methods follow the same preprocessing steps:

Search commands pulled into memory.
Data preparation including:
1. All Rows containing NAN values are removed prior to computing the score.
2. Errors raised should any categorical fields be found.

Common parameters:

The multioutput parameter default is raw_values.
raw_values returns a full set of regressions scores or errors between each field in fields_a and field_b respectively.

Parameters that take a list/ array as input are not supported.

Explained variance score

Calculate the explained variance regression score.

Implements sklearn.metrics.explained_variance_score. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score

Further reading: https://en.wikipedia.org/wiki/Explained_variation

Parameters:

The multioutput parameter default is raw_values.
The variance_weighted parameter is the scores of all outputs averages with the weights of each individual output's variance.
To see each explained variance score compared to the actual score, set the multioutput parameter to raw_values.

Syntax

...|score explained_variance_score <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> multioutput=<raw_values(default) | uniform_average | variance_weighted>

Syntax constraints

Explained variance score supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Explained variance is not symmetric.

Example

In the following example, we manually specify fields (particularly for the second call to fit and onwards) since a predicted field exists in the data.

| inputlookup server_power.csv
| fit LinearRegression ac_power from total-cpu-utilization total-disk-accesses total-disk-utilization as ac_power_LR
| fit RandomForestRegressor ac_power from total-cpu-utilization total-disk-accesses total-disk-utilization as ac_power_RFR
| score explained_variance_score ac_power_LR against ac_power_RFR

To see each explained variance score compared to the actual score, set multioutput to raw_values.

Mean absolute error score

Calculate the mean absolute error (regression loss) between actual_fields and predicted_fields.

Implements sklearn.metrics.mean_absolute_error. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error

Further reading: https://en.wikipedia.org/wiki/Mean_absolute_error

Parameters:

Only supports common parameters.

Syntax

...|score mean_absolute_error <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> multioutput=<raw_values(default) | uniform_average>

Syntax constraints

Mean absolute error supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Example

In the following example, we manually specify fields (particularly for the second call to fit and onwards) since a predicted field exists in the data.

If the user wants to see each mean absolute score compared to the actual score, multioutput must be set to raw_values. if set to another value a warning message is returned.

| inputlookup power_plant.csv 
| fit LinearRegression Energy_Output from Temperature Pressure Humidity Vacuum fit_intercept=true as energy_output_LR 
| fit Lasso Energy_Output from Temperature Pressure Humidity Vacuum as energy_output_LASSO 
| score mean_absolute_error Energy_Output energy_output_LR against energy_output_LASSO

Mean squared error

Calculate the mean square error (regression loss) between actual_fields and predicted_fields.

Implements sklearn.metrics.mean_squared_error. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error

Further reading: https://en.wikipedia.org/wiki/Mean_squared_error

Parameters:

Only supports common parameters.

Syntax

...|score mean_squared_error <actual_field__1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> multioutput=<raw_values(default) | uniform_average>

Syntax constraints

Mean squared error supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

Example

In the following example, we manually specify fields (particularly for the second call to fit and onwards) since a predicted field exists in the data.

| inputlookup power_plant.csv 
| fit LinearRegression Energy_Output from Temperature Pressure Humidity Vacuum fit_intercept=true as energy_output_LR 
| fit Lasso Energy_Output from Temperature Pressure Humidity Vacuum as energy_output_LASSO 
| score mean_squared_error Energy_Output against energy_output_LR energy_output_LASSO

R2 score

Calculate the r2 score between actual_fields and predicted_fields.

Implements sklearn.metrics.r2_score. Learn more here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score

Further reading: https://en.wikipedia.org/wiki/Coefficient_of_determination

Parameters:

The multioutput default is raw_values.
The variance_weighted parameter is scores of all outputs are averaged with weights of each individual output's variance.
The none parameter acts the same as the uniform_averageparameter.

Syntax

...|score r2_score <actual_field_1> ... <actual_field_n> against <predicted_field_1> ... <predicted_field_n> multioutput=<raw_values(default) | uniform_average | variance_weighted | None >

Syntax constraints

R2 score supports 1-to-1, n-to-n and 1-to-n comparison syntaxes.

R2 score is not symmetric.

Example

In the following example, we manually specify fields (particularly for the second call to fit and onwards) since a predicted field exists in the data.

| inputlookup server_power.csv
| fit LinearRegression ac_power from total-cpu-utilization total-disk-accesses total-disk-utilization into LR_model
| fit RandomForestRegressor ac_power from total-cpu-utilization total-disk-accesses total-disk-utilization into RFR_model
| fit Lasso ac_power from total-cpu-utilization total-disk-accesses total-disk-utilization into LASSO_model
| fit DecisionTreeRegressor ac_power from total-cpu-utilization total-disk-accesses total-disk-utilization into DTR_model

After training several regressors to predict ac_power, we can analyze their predictions compared to the ground-truth.

| inputlookup server_power.csv 
| apply LR_model as LR_prediction
| apply RFR_model as RFR_prediction
| apply LASSO_model as LASSO_prediction
| apply DTR_model as DTR_prediction
| score r2_score ac_power against LR_prediction RFR_prediction LASSO_prediction DTR_prediction

Statistical functions (statsfunctions)

Statsfunctions are general statistical methods that either provide statistical information about data, or perform a statistical test on data. A statistic/p-value is not returned.

Statistical functions scoring in the Splunk Machine Learning Toolkit includes the following methods:

Preprocessing

All statistical functions scoring methods follow the same preprocessing steps:

Search commands pulled into memory.
Data preparation including:
1. All rows containing NAN values removed prior to computing the score.
2. Errors raised should any categorical fields be found.

Describe

Compute several descriptive statistics of the passed array.

Implements scipy.stats.describe. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.describe.html

Parameters

The ddof parameter default is 1.
- ddof=delta degrees of freedom.
- delta degrees of freedom used only for variance.
If the bias parameter is false, then the skewness and kurtosis calculations are corrected for statistical bias

Syntax

|score describe <a_field_1> <a_field_2> ... <a_field_n> ddof=<int> bias=<true|false>

Syntax constraints

A single sequence of fields.

Example

| inputlookup diabetes.csv
| score describe blood_pressure diabetes_pedigree glucose_concentration

Example output

Moment

A moment is a specific quantitative measure of the shape of a set of points. It is often used to calculate coefficients of skewness and kurtosis due to its close relationship with them

Implements scipy.stats.moment. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.moment.html

Further reading: https://en.wikipedia.org/wiki/Moment_(mathematics)

Parameters

The moment parameter default is 1.

Syntax

|score moment  <a_field_1> <a_field_2> ... <a_field_n> moment=<int>

Syntax constraints

A sequence of fields.

Example

The following example is set-up to calculate the third moment of the given data.

| inputlookup diabetes.csv
| score moment blood_pressure diabetes_pedigree glucose_concentration moment=3

Example output

Pearsonr

Calculate a Pearson correlation coefficient and the p-value for testing non-correlation.

Implements scipy.stats.spearmanr. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

Further reading: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

Parameters

No parameters.

Syntax

|score pearsonr <a_field> against <b_field>

Syntax constraints

A pair of fields (1-to-1 comparison).

Returns

Pearson correlation coefficient and the p-value for testing non-correlation.

Example

| inputlookup track_day.csv
| score pearsonr engineSpeed against speed

Example output

Spearmanr

Calculate a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.

Implements scipy.stats.spearmanr. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

Further reading: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Parameters

No parameters.

Syntax

|score spearmanr <a_field> against <b_field>

Syntax constraints

A pair of fields (1-to-1 comparison).

Returns

Spearman correlation coefficient and the p-value to test for non-correlation.

Example

| inputlookup track_day.csv
| score spearmanr engineSpeed against speed

Example output

Tmean

This function finds the arithmetic mean of given values, ignoring values outside the given limits.

Implements scipy.stats.tmean. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tmean.html

Further reading: https://en.wikipedia.org/wiki/Truncated_mean

Parameters

The optional lower_limit parameter default is none.
- The lower bound of data to include. If None, there is no lower bound.
The optional upper_limit parameter default is None.
- The upper bound of data to include. If None, there is no upper bound.

Syntax

|score tmean <a_field_1> ... <a_field_n> lower_limit=<float|None> upper_limit=<float|None>

A global trimmed-mean is calculated across all fields.

Syntax constraints

A sequence of fields.

Returns

Returns a single value representing the trimmed mean of the data (ie. the mean ignoring samples outside of the given bounds).

Example

| inputlookup diabetes.csv
| score tmean blood_pressure diabetes_pedigree glucose_concentration lower_limit=-1 upper_limit=1

Example output

Trim

Slices off a proportion of items from both ends of an array.

Implements scipy.stats.trim. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.trimboth.html

Parameters

The tail parameter default is both.
- Whether to cut off (trim) data from the left, right or both sides of the distribution.
In the proportiontocut parameter (fraction to cut off), float must be specified.

Syntax

|score trim <a_field_1> ... <a_field_n> proportiontocut=<float> tail=<left|right|both>

Syntax constraints

A sequence of fields.

Returns

Trimmed version of the data. Note that the order of the trimmed content is undefined.

Example

| inputlookup diabetes.csv
| score trim glucose_concentration tail=both proportiontocut=0.1

Example output

Tvar

Compute the trimmed variance. This function computes the sample variance of an array of values, while ignoring values which are outside of given limits.

Implements scipy.stats.ttest_ind. Learn more here: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.tvar.html

Parameters

The optional lower_limit parameter default is None.
- The lower bound of data to include. If None, there is no lower bound.
The optional upper_limit parameter default is None.
- The upper bound of data to include. If None, there is no upper bound.

Syntax

|score tvar <a1> <a2> ... <an> lower_limit=<float|None> upper_limit=<float|None>

A global trimmed-variance is calculated across all fields.

Syntax constraints

A sequence of fields.

Returns

Returns a single value representing the trimmed variance of the data (ie. the variance ignoring samples outside of the given bounds).

Example

| inputlookup diabetes.csv
| score tvar blood_pressure diabetes_pedigree glucose_concentration lower_limit=-1 upper_limit=1

Example output

Statistical testing (statstest)

Statistical testing (statstest) scoring is used to validate/invalidate a statistical hypothesis. The output of statstest scoring methods is a test-specific statistic, and a corresponding p-value.

All statistical-testing methods support the parameter alpha, which indicates the alpha-level (significant level) for the statistical test (i.e. the probability of rejecting the null hypothesis when the null hypothesis is true). The default value is 0.05.

Statistical testing in the Splunk Machine Learning Toolkit includes the following methods:

Overview

The inputs to the statstest scoring methods are array(s) of data specified by an ordered sequence of fields.
The arrays for statistical testing methods are referred to as a_array and b_array.
- In general, statistical testing methods are commutative (ie. a_field against b_field is equivalent to b_field against a_field).
- Arrays array_a and array_b are specified by a sequence of fields a_field_1 ... a_field_n and b_field_1 ... b_field_n.
Statstest scoring methods may operate on a single array. For example:

..| score describe <array_a>) or 2 arrays (eg. ..| score ks_2samp <array_a> against <array_b>)

Preprocessing

All statistical testing scoring methods follow the same preprocessing steps:

Search commands pulled into memory.
Data preparation including:
1. All rows containing NAN values removed prior to computing the score.
2. Errors raised should any categorical fields be found.

For scoring methods requiring 2 arrays, use the against (equivalently, ~) clause to separate the arrays.

Augmented Dickey-Fuller (Adfuller)

The Augmented Dickey-Fuller test can be used to test for a unit root in a univariate process in the presence of serial correlation.

Implements statsmodels.tsa.stattools.adfuller. Learn more here: https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html

Further reading: https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test

Parameters

The maxtag parameter default is 10.
The maxtag parameter determines the maximum lag included in the test.
The regression parameter default is c.
The regression parameter determines the constant and trend order to include in the regression.
- c: constant only (default).
- ct: constant and trend.
- ctt: constant, and linear and quadratic trend.
- nc: no constant, no trend.
The autolag parameter default is AIC.
- If None, then maxlag tags are used.
- If AIC or BIC, then the number of lags is chosed to minimize the corresponding information criterion.
- The parameter stat starts with maxlag and drops a lag until the t-statistic on the last lag length is significant using a 5%-sized test.
The alpha parameter default is 0.05.

Null hypothesis

The null hypothesis of the Augmented Dickey-Fuller is that there is a unit root, with the alternative that there is no unit root.

Syntax

|score adfuller <field> autolag=<aic|bic|t-stat|none> regression=<c|ct|ctt|nc> maxlag=<int> alpha=<float>

Syntax constraints

A single field.

Example

| inputlookup app_usage.csv
| score adfuller HR1

Example output

KS-test (1 sample)

Test whether the specified field is statistically identical to the specified cumulative distribution function (cdf).

Implements scipy.stats.kstest. Learn more here: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html

Further reading: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

Parameters Each cdf has a required set of parameters that must be specified.

Parameter	Required information
cdf = chi2	df <int> loc <float> scale <float>
cdf = longnorm	s <float> loc <float> scale <float>
cdf = norm	loc <float> scale <float>

Null hypothesis

The sample (<field>) distribution is identical to the specified distribution (cdf, with cdf params).

Syntax

|score kstest <field> cdf=<norm | lognorm | chi2> <required_cdf_parameters> alpha=<int>

All required cdf parameters must be supplied. Review in the Parameters table above.

Syntax constraints

A single field.

Example

| inputlookup power_plant.csv
| score kstest Humidity cdf=norm loc=65 scale=2

Example output

In the following visualization we can reject the hypothesis that the field Humidity is identical to a q-function with mean 65 and standard deviation 2.

KS-test (2 samples)

Computes the Kolmogorov-Smirnov statistic on two samples to test if two independent samples are drawn from the same distribution.

Implements scipy.stats.ttest_ind. Learn more here: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ks_2samp.html

Further reading: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample%20Kolmogorov%E2%80%93Smirnov%20test

Parameters

The alpha parameter default is 0.05.

Null hypothesis

This is a two-sided test for the null hypothesis that two independent samples are drawn from the same continuous distribution.

Syntax

|score ks_2samp <a_field> against <b_field> alpha=<int>

Syntax constraints

A single pair of fields (1-to-1 comparison).

Example

In the following example we want to see if the two measurements of the HR field are drawn from the same distribution.

| inputlookup app_usage.csv
| score ks_2samp HR1 against HR2

Example output

In the following output we reject the null hypothesis that the two samples were were drawn from the same distribution.

Kwiatkowski-Phillips-Schmidt-Shin (KPSS)

Computes the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test for the null hypothesis that the field (<field>) is level or trend stationary.

Implements statsmodels.tsa.stattools.kpss. Learn more here: https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.kpss.html

Further reading: https://en.wikipedia.org/wiki/KPSS_test

Parameters

The regression parameter default is c.
- The regression parameter indicates the null hypothesis for the KPSS test.
- The c parameter indicates that the data is stationary around a constant.
- The cf parameter indicates that the data is stationary around a trend.
The lags parameter default is none.
- The lags parameter indicates the number of lags to be used. If None, set to int(12 * (n / 100)**(1 / 4)), where n is the number of samples.
The alpha default is 0.05.

Null hypothesis

The null hypothesis of the Kwiatkowski-Phillips-Schmidt-Shin is that the field (field) is level or trend stationary.

Syntax

|score kpss <field> regression=<c | ct> lags=<int> alpha=<float>

Syntax constraints

A single field.

Example

| inputlookup app_usage.csv
| score kpss HR1

Example output

MannWhitneyU

Used to test whether a randomly selected value from one sample will be less or greater than a randomly selected value from another sample.

Implements scipy.stats.mannwhitneyu. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

Further reading: https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

Parameters

The use continuity parameter determines whether a continuity correction (1/2) should be taken into account.
The use continuity parameter default is true.
The alternative parameter determines whether to get the p-value for the one-sided hypothesis (less or greater) or for the two-sided hypothesis (two-sided).
The alternative parameter default is two-sided.
The alpha parameter default is 0.05.

Null hypothesis

It is equally likely that a randomly selected value from one sample is less than or greater than a randomly selected value from another sample.

Syntax

|score mannwhitneyu <a_field> against <b_field> use_continuity=<true|false> alternative=<less|two-sided|greater> alpha=<int>

Syntax constraints

A single pair of fields (1-to-1).

Example

| inputlookup churn.csv
| score mannwhitneyu "Day Charge" against "Eve Charge" alternative=greater

Example output

In the following visualization we can conclude that a random sample from Day Charge is likely greater than a random sample in Eve Charge.

Normal-test

Used to test whether a sample differs from a normal distribution.

Implements scipy.stats.normaltest. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html

One-way ANOVA

The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean.

Implements scipy.stats.f_oneway. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html

Further reading: https://en.wikipedia.org/wiki/One-way_analysis_of_variance

Parameters

The alpha parameter default is 0.05.

Null hypothesis

The specified groups (field_1, ..., field_n) have the same population mean.

Syntax

|score f_oneway <field_1> <field_2> ... <field_n> alpha=<int>

Syntax constraints

A single array (set of fields).

Example

| inputlookup app_usage.csv 
| score f_oneway HR1 HR2

Example output

T-test (1 sample)

Test whether the expected value (mean) of a sample of independent observations is equal to the specified population mean.

Implements scipy.stats.ttest_1samp. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html

Further reading: http://www.biostathandbook.com/onesamplettest.html

Parameters:

The popmean parameter has no default, and must be specified.
- Population mean in null hypothesis.
The alpha parameter default is 0.05.

Null hypothesis

The expected value (mean) of the specified samples of independent observations (field_1, ... ,field_n) are equal to the given population mean (popmean).

Syntax

...|score ttest_1samp <field_1> ... <field_n> popmean=<float> alpha=<int>

Syntax constraints

A single array (set of fields).

Example

The following example tests whether the sample mean differs from an expected population mean.

| inputlookup power_plant.csv
| score ttest_1samp Temperature popmean=20

Example output

The negative statistic indicates that the sample mean is less than the hypothesized mean of 20.

T-test (2 independent samples)

Used to test if two independent samples come from the same distribution.

Implements scipy.stats.ttest_ind. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

Further reading: https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test

Parameters

The equal_var parameter default is True.
- If True (default), perform a standard independent 2 sample test that assumes equal population variances.
- If False, perform Welch’s t-test, which does not assume equal population variance.
The alpha default is 0.05.

Null hypothesis

The null hypothesis is that the the pairs a_field_i and b_field_i (independent) samples have identical average (expected) values. This test assumes that the fields have identical variances by default.

Syntax

|score ttest_ind <a_field_1> ... <a_field_n> against <b_field_1>... <b_field_n> equal_var=<true|false> alpha=<int>

Syntax constraints

Two arrays specified by two ordered sequences of fields (1-to-1, n-to-n and 1-to-n comparison syntaxes).

Example

In the following example we want to analyze disk failures to see if disks are equally likely to fail, or some disks are more likely to cause failure.

| inputlookup disk_failures.csv
| score ttest_ind SMART_1_Raw against SMART_2_Raw SMART_3_Raw SMART_4_Raw

Disk failures are assumed to be independent across disks.

Example output

In the following output, with an alpha of 0.05, we cannot reject the null hypothesis. It does not seem that disks 2, 3 and 4 are failing more than disk 1. All are close to each other.

T-test (2 related samples)

Used to test if two related samples come from the same distribution.

Implements scipy.stats.ttest_ind. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

Further reading: https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent%20t-test%20for%20paired%20samples

Parameters

The alpha parameter default is 0.05.

Null hypothesis

The null hypothesis is that the the pairs a_field_i and b_field_i (related, as in two measurements of the same thing) samples have identical average (expected) values.

Syntax

|score ttest_rel <a_field_1> ... <a_field_n> against <b_field_1> ... <b_field_n> alpha=<int>

Syntax constraints

Two arrays specified by two ordered sequences of fields (1-to-1, n-to-n and 1-to-n comparison syntaxes).

Example

In the following example we want to see if the two measurements of the HR field taken at the same time are statistically identical or not.

| inputlookup app_usage.csv
| score ttest_rel HR1 against HR2

Example output

In the following output we can reject the null hypothesis and conclude that the two measurements are statistically different, potentially indicating a shift from equilibrium.

Wilcoxon

Used to test if two related samples come from the same distribution.

Implements scipy.stats.wilcoxon. Learn more here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

Further reading: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

Parameters

The zero-method parameter default is wilcox.
Use the pratt parameter to include zero-differences in ranking process (more conservative).
Use the wilcox parameter to discard zero-differences.
Use the zsplit parameter to split zero ranks between positive and negative.
The correction parameter default is False.
- If the correction parameter is True, apply continuity correction by adjusting the Wilcoxon rank statistic by 0.5 towards the mean value when computing the z-statistic.
The alphay parameter default is 0.05.

Null hypothesis

The null hypothesis is that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero.

Syntax

|score wilcoxon <a_field> against <b_field> zero_method=<pratt | wilcox | zsplit> correction=<true | false> alpha=<int>

Syntax constraints

A single pair of fields (1-to-1).

Example

In the following example we want to see if the distribution of nighttime minutes used differs from the distribution of evening minutes:

| inputlookup churn.csv
| score wilcoxon "Night Mins" against "Eve Mins"

Example output

K-fold scoring

Cross-validation is a general technique aimed to assess how well a statistical model generalizes on an independent dataset. That is, cross-validation will tell you how well your machine learning model is expected to perform on data that it has not been trained on. The scores obtained from K-fold cross-validation are generally a less biased / less optimistic estimate of the model performance than a standard train/test split.

There are many types of cross validation, but k-fold cross validation (kfold_cv) is one of the most common.

The kfold_cv command does not use the score command, but operates like a scoring method.

Cross validation is typically used for:

Comparing two or more algorithms against each other for selecting the best choice on a particular dataset.
Comparing different choices of hyper-parameters on the same algorithm for choosing the best hyper-parameters for a particular dataset.
An improved method over a train/test split for quantifying model generalization.

Cross validation is not suited for time-series:

In situations where the data is ordered (eg. time-series), cross validation is not well suited since the training data is shuffled. In these situations, other methods are better suited (eg. Forward Chaining).
- The simplest implementation would probably be to wrap sklearn's Time Series Split. Learn more here: https://en.wikipedia.org/wiki/Forward_chaining

In kfold_cv, the training set is randomly partitioned into k equal-sized subsamples. Then, each sub-sample takes a turn at becoming the validation (test) set, predicted by the other k-1 training sets. Each sample is used exactly once in the validation set, and the variance of the resulting estimate is reduced as k is increased. The disadvantage of kfold_cv is that k different models have to be trained, leading to long execution times for large datasets and complex models.

In this way, we can obtain k performance metrics (one for each train/test split). These k performance metrics can then be averaged over to obtain a single estimate of how well the model will generalize to unseen data.

Syntax

The kfold_cv command is applicable to to all classification and regression algorithms, and can be appended to the end of an SPL search as:

..| fit <classification | regression algo> <targetVariable> from <featureVariables> [options] kfold_cv=<int>

Where kfold_cv=<int> specifies that k=<int> folds will be used. When a classification algorithm is specified, stratified k-fold will be used instead of k-fold. In stratified k-fold, each fold contains approximately the same percentage of samples for each class.

kfold_cv cannot be used when saving a model.

Output

The kfold_cv command returns performance metrics on each fold using the same model (ie. same algorithm, same hyper-parameters) specified in the SPL. It's only function is to give the user insight into how well their model generalizes – it does not perform any model selection or hyper-parameter tuning. In this way, the current implementation can be seen as a scoring method.

Example 1

The kfold_cv command used in classification. Note that the output is a set of metrics for each fold (accuracy, f1_weighted, precision_weighted, recall_weighted)

Example 2

The kfold_cv command used in classification. Note that the output is a set of metrics for each fold (neg_mean_squared_error, r^2)

Related answers from Splunk Community

Using the score command

Classification

Accuracy scoring

Confusion matrix

F1-score

Precision

Precision-Recall-F1-Support

Recall

ROC-AUC-score

ROC-curve

Clustering scoring

Silhouette score

Regression scoring

Explained variance score

Mean absolute error score

Mean squared error

R2 score

Statistical functions (statsfunctions)

Describe

Moment

Pearsonr

Spearmanr

Tmean

Trim

Tvar

Statistical testing (statstest)

Augmented Dickey-Fuller (Adfuller)

KS-test (1 sample)

KS-test (2 samples)

Kwiatkowski-Phillips-Schmidt-Shin (KPSS)

MannWhitneyU

Normal-test

One-way ANOVA

T-test (1 sample)

T-test (2 independent samples)

T-test (2 related samples)

Wilcoxon

K-fold scoring

Comments

Using the score command

Was this topic useful?