Algorithms
The Splunk Machine Learning Toolkit supports the algorithms listed here. In addition to the examples included in the Splunk Machine Learning Toolkit, you can find more examples of these algorithms on the scikit-learn website at http://scikit-learn.org/stable/auto_examples/index.html.
ML-SPL Quick Reference Guide
Download the Machine Learning Toolkit Quick Reference Guide for a handy cheat sheet of ML-SPL commands and machine learning algorithms used in the Splunk Machine Learning Toolkit.
Extend the algorithms you can use for your models
The 27 algorithms listed on this page and in the ML-SPL Quick Reference Guide are available natively in the Splunk Machine Learning Toolkit. You can also base your models on over 300 open source Python algorithms from scikit-learn, pandas, statsmodel, numpy and scipy libraries available through the Python for Scientific Computing add-on. For information on how to import an algorithm from the Python for Scientific Computing add-on into the Splunk Machine Learning Toolkit, see the ML-SPL API Guide.
Classifiers
Classifier algorithms predict the value of a categorical field.
DecisionTreeClassifier
The DecisionTreeClassifier algorithm uses the scikit-learn DecisionTreeClassifier estimator to fit a model to predict the value of categorical fields.
Syntax
fit DecisionTreeClassifier <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<N>] [max_features=<str>] [min_samples_split=<N>] [max_leaf_nodes=<N>] [criterion=<gini|entropy>] [splitter=<best|random>]
Example
... | fit DecisionTreeClassifier SLA_violation from * into sla_model | ...
For descriptions of the max_depth
, max_features
, min_samples_split
, max_leaf_nodes
, criterion
, and splitter
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.
You can save DecisionTreeClassifier models by using the into
keyword and apply it to new data later by using the apply
command (for example, ... | apply model_DTC
).
You can inspect the decision tree learned by DecisionTreeClassifier with the summary
command (for example, | summary model_DTC
). You can also get a JSON representation of the tree by giving json=t
as an argument to the summary
command (for example, | summary model_DTC json=t
). To specify the maximum depth of the tree to summarize, use the limit
argument (for example, | summary model_DTC limit=10
). The default value for the limit
argument is 5.
LogisticRegression
The LogisticRegression algorithm uses the scikit-learn LogisticRegression estimator to fit a model to predict the value of categorical fields.
Syntax
fit LogisticRegression <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>] [probabilities=<true|false>]
Example
... | fit LogisticRegression SLA_violation from IO_wait_time into sla_model | ...
The fit_intercept
parameter specifies whether the model includes an implicit intercept term (the default value is true).
The probabilities
parameter specifies whether probabilities for each possible field value should be returned alongside the predicted value (the default value is false).
You can save LogisticRegression models using the into
keyword and apply new data later using the apply
command (for example, ... | apply sla_model
).
You can inspect the coefficients learned by LogisticRegression with the summary
command (for example, | summary sla_model
).
RandomForestClassifier
The RandomForestClassifier algorithm uses the scikit-learn RandomForestClassifier estimator to fit a model to predict the value of categorical fields.
Syntax
fit RandomForestClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [n_estimators=<N>] [max_depth=<N>] [max_features=<str>] [min_samples_split=<N>] [max_leaf_nodes=<N>]
Example
... | fit RandomForestClassifier SLA_violation from * into sla_model | ...
For descriptions of the n_estimators
, max_depth
, max_features
, min_samples_split
, and max_leaf_nodes
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
You can save RandomForestClassifier models using the into
keyword and apply new data later using the apply
command (for example, ... | apply sla_model
).
You can list the features that were used to fit the model, as well as their relative importance or influence with the summary
command (for example, | summary sla_model
).
SVM
The SVM algorithm uses the scikit-learn kernel-based SVC estimator to fit a model to predict the value of categorical fields. It uses the radial basis function (rbf) kernel by default.
Syntax
fit SVM <field_to_predict> from <explanatory_fields> [into <model name>] [C=<number>] [gamma=<number>]
Example
... | fit SVM SLA_violation from * into sla_model | ...
The gamma
parameter controls the width of the rbf kernel (the default value is 1 / <number of fields>), and the C
parameter controls the degree of regularization when fitting the model (the default value is 1.0).
For descriptions of the C
and gamma
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.
You can save SVM models using the into
keyword and apply new data later using the apply
command (for example, ... | apply sla_model
).
You cannot inspect the model learned by SVM with the summary
command.
Note: Kernel-based methods such as the scikit-learn SVC tend to work best when the data is scaled, for example, using our StandardScaler algorithm: | fit StandardScaler <fields> into scaling_model | fit SVM <response> from <fields> into svm_model
. For details, see A Practical Guide to Support Vector Classification at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
BernoulliNB
The BernoulliNB algorithm uses the scikit-learn BernoulliNB estimator (an implementation of the Naive Bayes classification algorithm) to fit a model to predict the value of categorical fields, where explanatory variables are assumed to be binary-valued. This algorithm supports incremental fit.
Syntax
fit BernoulliNB <field_to_predict> from <explanatory_fields> [into <model name>] [alpha=<float>] [binarize=<float>] [fit_prior=<true|false>] [partial_fit=<true|false>]
Example
... | fit BernoulliNB type from * into TESTMODEL_BernoulliNB alpha=0.5 binarize=0 fit_prior=f
- The
alpha
parameter controls Laplace/Lidstone smoothing (the default value is 1.0). - The
binarize
parameter is a threshold that can be used for converting numeric field values to the binary values expected by BernoulliNB (the default value is 0). For instance, ifbinarize=0
is specified (the default), values > 0 are assumed to be 1, and values <= 0 are assumed to be 0. - The
fit_prior
Boolean parameter specifies whether to learn class prior probabilities (the default value is "true"). Iffit_prior=f
is specified, classes are assumed to have uniform popularity. - The
partial_fit
parameter controls whether an existing model should be incrementally updated or not (the default value is "false"). This allows you to update an existing model using only new data without having to retrain it on the full training data set. Example usingpartial_fit
:| inputlookup iris.csv | fit BernoulliNB species from * partial_fit=true into My_Incremental_Model
In the example above, ifMy_Incremental_Model
does not exist, the model is saved to it. IfMy_Incremental_Model
exists and was trained using BernoulliNB, the command updates the existing model with the new input. IfMy_Incremental_Model
exists but was not trained by BernoulliNB, an error message will be given. Usingpartial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. Ifpartial_fit=false
orpartial_fit
is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.
You can save BernoulliNB models using the into
keyword and apply the saved model later to new data using the apply
command (for example, ... | apply TESTMODEL_BernoulliNB
).
You cannot inspect the model learned by BernoulliNB with the summary
command.
GaussianNB
The GaussianNB algorithm uses the scikit-learn GaussianNB estimator (an implementation of Gaussian Naive Bayes classification algorithm) to fit a model to predict the value of categorical fields, where the likelihood of explanatory variables is assumed to be Gaussian. This algorithm supports incremental fit.
Syntax
fit GaussianNB <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<true|false>]
Example
... | fit GaussianNB species from * into TESTMODEL_GaussianNB
The partial_fit
parameter controls whether an existing model should be incrementally updated or not (default is "false"). This allows you to update an existing model using only new data without having to retrain it on the full training data set.
Example using partial_fit
:| inputlookup iris.csv | fit GaussianNB species from * partial_fit=true into My_Incremental_Model
In the example above, if My_Incremental_Model
does not exist, the model is saved to it. If My_Incremental_Model
exists and was trained using GaussianNB, the command updates the existing model with the new input. If My_Incremental_Model
exists but was not trained by GaussianNB, an error message will be given. If partial_fit=false
or partial_fit
is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.
You can save GaussianNB models using the into
keyword and apply the saved model later to new data using the apply
command (for example, ... | apply TESTMODEL_GaussianNB
).
You cannot inspect the model learned by GaussianNB with the summary
command.
SGDClassifier
The SGDClassifier algorithm uses the scikit-learn SGDClassifier estimator to fit a model to predict the value of categorical fields. This algorithm supports incremental fit.
Syntax
fit SGDClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<true|false>] [loss=<hinge|log|modified_huber|squared_hinge|perceptron>] [fit_intercept=<true|false>] [random_state=<N>] [n_iter=<N>] [l1_ratio=<float>] [alpha=<float>] [eta0=<float>] [power_t=<float>] [penalty=<l1|l2|elasticnet>] [learning_rate=<constant|optimal|invscaling>]
Example
... | fit SGDClassifier SLA_violation from * into sla_model
The SGDClassifier algorithm supports the following parameters:
partial_fit=<true|false>
: Specifies whether an existing model should be incrementally updated or not (default "false"). Example usingpartial_fit
:| inputlookup iris.csv | fit SGDClassifier species from * partial_fit=true into My_Incremental_Model
In the example above, ifMy_Incremental_Model
does not exist, the model is saved to it. IfMy_Incremental_Model
exists and was trained using SGDClassifier, the command updates the existing model with the new input. IfMy_Incremental_Model
exists but was not trained by SGDClassifier, an error message will be given. Usingpartial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. Ifpartial_fit=false
orpartial_fit
is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.loss=<hinge|log|modified_huber|squared_hinge|perceptron>
: The loss function to be used. Defaults to "hinge", which gives a linear SVM. The "log" loss gives logistic regression, a probabilistic classifier. "modified_huber" is another smooth loss that brings tolerance to outliers as well as probability estimates. "squared_hinge" is like hinge but is quadratically penalized. "perceptron" is the linear loss used by the perceptron algorithm.fit_intercept=<true|false>
: Specifies whether the intercept should be estimated or not (default "true").n_iter=<int>
: The number of passes over the training data (aka epochs) (default 5). The number of iterations is set to 1 if using partial_fit.penalty=<l2|l1|elasticnet>
: The penalty (aka regularization term) to be used (default "l2").learning_rate=<constant|optimal|invscaling>
The learning rate. "constant": eta = eta0, "optimal": eta = 1.0/(alpha * t), "invscaling": eta = eta0 / pow(t, power_t) (default "invscaling").l1_ratio=<float>
: The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1 (default 0.15). l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.alpha=<float>
: Constant that multiplies the regularization term (default 0.0001). Also used to compute learning_rate when set to "optimal".eta0=<float>
: The initial learning rate (default 0.01).power_t=<float>
: The exponent for inverse scaling learning rate (default 0.25).random_state=<int>
: The seed of the pseudo random number generator to use when shuffling the data.
You can save SGDClassifier models using the into
keyword and apply the saved model later to new data using the apply
command (for example, ... | apply sla_model
).
You can inspect the model learned by SGDClassifier with the summary
command (for example, | summary sla_model
).
Regressors
Regressor algorithms predict the value of a numeric field.
DecisionTreeRegressor
The DecisionTreeRegressor algorithm uses the scikit-learn DecisionTreeRegressor estimator to fit a model to predict the value of numeric fields.
Syntax
fit DecisionTreeRegressor <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<N>] [max_features=<str>] [min_samples_split=<N>] [max_leaf_nodes=<N>] [splitter=<best|random>]
Example
... | fit DecisionTreeRegressor temperature from date_mongth date_hour into temperature_model | ...
For descriptions of the max_depth
, max_features
, min_samples_split
, max_leaf_nodes
, and splitter
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html.
You can save DecisionTreeRegressor models using the into
keyword and apply it to new data later using the apply
command (for example, ... | apply model_DTR
).
You can inspect the decision tree learned by DecisionTreeRegressor with the summary
command (for example, | summary model_DTR
). Furthermore, you can get a JSON representation of the tree by giving json=t
as an argument to the summary
command (for example, | summary model_DTR json=t
). To specify the maximum depth of the tree to summarize, use the limit
argument (for example, | summary model_DTC limit=10
). The default value for the limit
argument is 5.
KernelRidge
The KernelRidge algorithm uses the scikit-learn KernelRidge algorithm to fit a model to predict numeric fields. This algorithm uses the radial basis function (rbf) kernel by default.
Syntax
fit KernelRidge <field_to_predict> from <explanatory_fields> [into <model_name>] [gamma=<number>]
Example
... | fit KernelRidge temperature from date_month date_hour into temperature_model | ...
The gamma
parameter controls the width of the rbf kernel (the default value is 1/number of fields). For details, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html.
You can save KernelRidge models using the into
keyword and apply new data later using the apply
command (for example, ... | apply sla_model
).
You cannot inspect the model learned by KernelRidge with the summary
command.
LinearRegression
The LinearRegression algorithm uses the scikit-learn LinearRegression estimator to fit a model to predict the value of numeric fields.
Syntax
fit LinearRegression <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>]
Example
... | fit LinearRegression temperature from date_month date_hour into temperature_model | ..
The fit_intercept
parameter specifies whether the model should include an implicit intercept term (the default value is "true").
You can save LinearRegression models using the into
keyword and apply new data later using the apply
command (for example, ... | apply temperature_model
).
You can inspect the coefficients learned by LinearRegression with the summary
command (for example, | summary temperature_model
).
RandomForestRegressor
The RandomForestRegressor algorithm uses the scikit-learn RandomForestRegressor estimator to fit a model to predict the value of numeric fields.
Syntax
fit RandomForestRegressor <field_to_predict> from <explanatory_fields> [into <model name>] [n_estimators=<N>] [max_depth=<N>] [max_features=<str>] [min_samples_split=<N>] [max_leaf_nodes=<N>]
Example
... | fit RandomForestRegressor temperature from date_month date_hour into temperature_model | ...
For descriptions of the n_estimators
, max_depth
, max_features
, min_samples_split
, and max_leaf_nodes
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
You can save RandomForestRegressor models using the into
keyword and apply new data later using the apply
command (for example, ... | apply temperature_model
).
You can list the features that were used to fit the model, as well as their relative importance or influence with the summary
command (for example, | summary temperature_model
).
Lasso
The Lasso algorithm uses the scikit-learn Lasso estimator to fit a model to predict the value of numeric fields. Lasso is like LinearRegression, but it uses L1 regularization to learn a linear models with fewer coefficients and smaller coefficients. Lasso models are consequently more robust to noise and resilient against overfitting.
Syntax
fit Lasso <field_to_predict> from <explanatory_fields> [into <model name>] [alpha=<float>]
Example
... | fit Lasso temperature from date_month date_hour | ...
The alpha
parameter controls the degree of L1 regularization. For details, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html.
You can save Lasso models using the into
keyword and apply new data later using the apply
command (for example, ... | apply temperature_model
).
You can inspect the coefficients learned by Lasso with the summary
command (for example, | summary temperature_model
).
ElasticNet
The ElasticNet algorithm uses the scikit-learn ElasticNet estimator to fit a model to predict the value of numeric fields. ElasticNet is a linear regression model that includes both L1 and L2 regularization (it is a generalization of Lasso and Ridge).
Syntax
fit ElasticNet <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>] [alpha=<N>] [l1_ratio=<N>]
Example
... | fit ElasticNet temperature from date_month date_hour normalize=true alpha=0.5 | ...
For descriptions of the fit_intercept
, normalize
, alpha
, and l1_ratio
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html.
You can save ElasticNet models using the into
keyword and apply new data later using the apply
command (for example, ... | apply temperature_model
).
You can inspect the coefficients learned by ElasticNet with the summary
command (for example, | summary temperature_model
).
Ridge
The Ridge algorithm uses the scikit-learn Ridge estimator to fit a model to predict the value of numeric fields. Ridge is like LinearRegression, but it uses L2 regularization to learn a linear models with smaller coefficients, making the algorithm more robust to collinearity.
Syntax
fit Ridge <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>] [alpha=<N>]
Example
... | fit Ridge temperature from date_month date_hour normalize=true alpha=0.5 | ...
The alpha
parameter specifies the degree of regularization (the default value is 1.0). For descriptions of the fit_intercept
, normalize
, and alpha
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html.
You can save Ridge models using the into
keyword and apply new data later using the apply
command (for example, ... | apply temperature_model
).
You can inspect the coefficients learned by Ridge with the summary
command (for example, | summary temperature_model
).
SGDRegressor
The SGDRegressor algorithm uses the scikit-learn SGDRegressor estimator to fit a model to predict the value of numeric fields. This algorithm supports incremental fit.
Syntax
fit SGDRegressor <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<true|false>] [fit_intercept=<true|false>] [random_state=<N>] [n_iter=<N>] [l1_ratio=<float>] [alpha=<float>] [eta0=<float>] [power_t=<float>] [penalty=<l1|l2|elasticnet>] [learning_rate=<constant|optimal|invscaling>]
Example
... | fit SGDRegressor temperature from date_month date_hour into temperature_model | ...
It supports the following parameters:
partial_fit=<true|false>
: Specifies whether an existing model should be incrementally updated or not (default "false"). Example usingpartial_fit
:| inputlookup server_power.csv | fit SGDRegressor "ac_power" from "total-cpu-utilization" "total-disk-accesses" partial_fit=true into My_Incremental_Model
In the example above, ifMy_Incremental_Model
does not exist, the model is saved to it. IfMy_Incremental_Model
exists and was trained using SGDRegressor, the command updates the existing model with the new input. IfMy_Incremental_Model
exists but was not trained by SGDRegressor, an error message will be given. Usingpartial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. Ifpartial_fit=false
orpartial_fit
is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.fit_intercept=<true|false>
: Whether the intercept should be estimated or not (default true).n_iter=<int>
: The number of passes over the training data (aka epochs) (default 5). The number of iterations is set to 1 if using partial_fit.penalty=<l2|l1|elasticnet>
: The penalty (aka regularization term) to be used (default "l2").learning_rate=<constant|optimal|invscaling>
The learning rate. constant: eta = eta0, optimal: eta = 1.0/(alpha * t), invscaling: eta = eta0 / pow(t, power_t) (default invscaling).l1_ratio=<float>
: The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1 (default 0.15). l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.alpha=<float>
: Constant that multiplies the regularization term (default 0.0001). Also used to compute learning_rate when set to "optimal".eta0=<float>
: The initial learning rate (default 0.01).power_t=<float>
: The exponent for inverse scaling learning rate (default 0.25).random_state=<int>
: The seed of the pseudo random number generator to use when shuffling the data.
You can save SGDRegressor models using the into
keyword and apply new data later using the apply
command (for example, ... | apply temperature_model
).
You can inspect the coefficients learned by SGDRegressor with the summary
command (for example, | summary temperature_model
).
Feature Extraction
Feature extraction algorithms transform fields for better prediction accuracy.
FieldSelector
The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.
Syntax
fit FieldSelector <field_to_predict> from <explanatory_fields> [into <model name>] [type=<categorical, numeric>] [mode=<k_best, fpr, fdr, fwe, percentile>] [param=<N>]
Example
... | fit FieldSelector type=categorical SLA_violation from * into sla_model | ...
The type
parameter specifies if the field to predict is categorical or numeric. For descriptions of the mode
and param
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html.
You can save FieldSelector models using the into
keyword and apply new data later using the apply
command (for example, ... | apply sla_model
).
You cannot inspect the model learned by FieldSelector with the summary
command.
TFIDF
The TFIDF algorithm uses the scikit-learn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data.
Syntax
fit TFIDF <field_to_convert> [into <model name>] [max_features=<N>] [max_df=<N>] [min_df=<N>] [ngram_range=<str>] [analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]
Example
... | fit TFIDF Reviews into user_feedback_model max_df=0.6 min_df=0.2 | ...
For descriptions of the max_features
, max_df
, min_df
, ngram_range
, analyzer
, norm
, and token_pattern
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.
The default for max_features
is 100.
To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set stop_words
to english
. For other languages (for example, machine language) you can ignore the common words by setting max_df
to a value greater than or equal to 0.7 and less than 1.0.
You can save TFIDF models using the into
keyword and apply new data later using the apply
command (for example, ... | apply user_feedback_model
).
You cannot inspect the model learned by TFIDF with the summary
command.
PCA
The PCA algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data.
Syntax
fit PCA <fields> [into <model name>] [k=<N>]
Example
... | fit PCA * k=3 | ...
The k
parameter specifies the number of features to be extracted from the data.
You can save PCA models using the into
keyword and apply new data later using the apply
command (for example, ... | apply user_feedback_model
).
You cannot inspect the model learned by PCA with the summary
command.
KernelPCA
The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. The difference between KernelPCA and PCA is the use of kernels in the former, which helps with finding nonlinear dependencies among the fields. Currently, KernelPCA only supports the Radial Basis Function (rbf) kernel.
Syntax
fit KernelPCA <fields> [into <model name>] [k=<N>] [gamma=<N>] [tolerance=<N>] [max_iteration=<N>]
Example
... | fit KernelPCA * k=3 gamma=0.001 | ...
The k
parameter specifies the number of features to be extracted from the data. The other parameters are for fine tuning of the kernel. For descriptions of the gamma
, tolerance
, and max_iteration
parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html.
You can save KernelPCA models using the into
keyword and apply new data later using the apply
command (for example, ... | apply user_feedback_model
).
You cannot inspect the model learned by KernelPCA with the summary
command.
Note: Kernel-based methods such as the scikit-learn KernelPCA tend to work best when the data is scaled, for example, using our StandardScaler algorithm: | fit StandardScaler <fields> into scaling_model | fit KernelPCA <fields> into kpca_model
. For details, see A Practical Guide to Support Vector Classification at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
Anomaly Detectors
Anomaly detection algorithms detect anomalies and outliers in numerical or categorical fields.
OneClassSVM
The OneClassSVM algorithm uses the scikit-learn OneClassSVM (an unsupervised outlier detection method) to fit a model from the set of features (i.e. fields) for detecting anomalies/outliers, where features are expected to contain numerical values.
Syntax
fit OneClassSVM <fields> [into <model name>] [kernel=<str>] [nu=<float>] [coef0=<float>] [gamma=<float>] [tol=<float>] [degree=<number>] [shrinking=<true|false>]
Example
... | fit OneClassSVM * kernel="poly" nu=0.5 coef0=0.5 gamma=0.5 tol=1 degree=3 shrinking=f into
TESTMODEL_OneClassSVM
The kernel
parameter specifies the kernel type ("linear", "rbf", "poly", "sigmoid") for using in the algorithm, where the default value of kernel="rbf". We can specify the upper bound on the fraction of training error as well as the lower bound of the fraction of support vectors using the nu
parameter, where the default value is 0.5. The degree
parameter is ignored by all kernels except the polynomial kernel, where the default value is 3. "gamma" is the kernel co-efficient that specifies how much influence a single data instance has, where the default value is 1/numberOfFeatures. "coef0" is the independent term in the kernel function which is only significant if we have polynomial or sigmoid function. "tol" is the tolerance for stopping criteria. The shrinking
parameter tells us whether to use the shrinking heuristic. For details, see http://scikit-learn.org/stable/modules/svm.html#kernel-functions.
You can save OneClassSVM models using the into
keyword and apply the saved model later to new data using the apply
command (for example, ... | apply TESTMODEL_OneClassSVM
). After running the fit
or apply
command, a new field named "isNormal" is generated that defines whether a particular record (row) is normal ("isNormal"=1) or anomalous ("isNormal"=-1).
You cannot inspect the model learned by OneClassSVM with the summary
command.
Clustering Algorithms
Clustering algorithms separate results into clusters.
KMeans
The KMeans algorithm uses the scikit-learn KMeans clustering algorithm to divide a result set into "k" distinct clusters. The cluster for each event is set in a new field named "cluster".
Syntax
fit KMeans <fields> [k=<N>] [as <output_field>]
Example
... | fit KMeans * k=3 | stats count by cluster
The k
parameter specifies the number of clusters to divide the data into. By default, the cluster label is assigned to a field named "cluster", but you may change that behavior by using the as
keyword to specify a different field name.
You can save KMeans models using the into
keyword and apply new data later using the apply
command (for example, ... | apply cluster_model
).
You can inspect the model learned by KMeans with the summary
command (for example, | summary cluster_model
).
For descriptions of default value of K
, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
DBSCAN
The DBSCAN algorithm uses the scikit-learn DBSCAN clusterer to divide a result set into distinct clusters. The cluster for each event is set in a new field named "cluster". DBSCAN is distinct from KMeans in that it clusters results based on local density, and uncovers a variable number of clusters, whereas KMeans finds a precise number of clusters (for example, k=5 finds 5 clusters).
Syntax
fit DBSCAN <fields> [eps=<number>] [as <output_field>]
Example
... | fit DBSCAN * | stats count by cluster
The eps
parameter specifies the maximum distance between two samples for them to be considered in the same cluster. By default, the cluster label is assigned to a field named "cluster", but you may change that behavior by using the as
keyword to specify a different field name.
You cannot save DBSCAN models using the into
keyword. If you want to be able to predict cluster assignments for future data, you can combine the DBSCAN algorithm with any clustering algorithm (for example, first cluster the data using DBSCAN, then fit a classifier to predict the cluster using RandomForestClassifier).
Birch
The Birch algorithm uses the scikit-learn Birch clustering algorithm to divide a result set into set of distinct clusters. The cluster for each event is set in a new field named "cluster". This algorithm supports incremental fit.
Syntax
fit Birch <fields> [into <model name>] [k=<N>] [as <output_field>] [partial_fit=<true|false>]
Example
... | fit Birch * k=3 | stats count by cluster
- The
k
parameter specifies the number of clusters to divide the data into after the final clustering step, which treats the subclusters from the leaves of the CF tree as new samples. By default, the cluster label is assigned to a field named "cluster", but user can change that behavior by using theas
keyword to specify a different field name. - The
partial_fit
parameter controls whether an existing model should be incrementally updated or not (default is "false"). This allows you to update an existing model using only new data without having to retrain it on the full training data set. Example usingpartial_fit
:| inputlookup track_day.csv | fit Birch * k=6 partial_fit=true into My_Incremental_Model
In the example above, ifMy_Incremental_Model
does not exist, the model is saved to it. IfMy_Incremental_Model
exists and was trained using Birch, the command updates the existing model with the new input. IfMy_Incremental_Model
exists but was not trained by Birch, an error message will be given. Usingpartial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. Ifpartial_fit=false
orpartial_fit
is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.
You can save Birch models using the into
keyword and apply new data later using the apply
command (for example, ... | apply Birch_model
).
You cannot inspect the model learned by Birch with the summary
command.
SpectralClustering
The SpectralClustering algorithm uses the scikit-learn SpectralClustering clustering algorithm to divide a result set into set of distinct clusters. SpectralClustering first transforms the input data using the Radial Basis Function (rbf) kernel, and then performs KMeans clustering on the result. Consequently, SpectralClustering can learn clusters with a non-convex shape. The cluster for each event is set in a new field named "cluster".
Syntax
fit SpectralClustering <fields> [k=<N>] [gamma=<float>] [as <output_field>]
Example
... | fit SpectralClustering * k=3 | stats count by cluster
The k
parameter specifies the number of clusters to divide the data into after kernel step. By default, the cluster label is assigned to a field named "cluster", but you can change that behavior by using the as
keyword to specify a different field name.
You cannot save SpectralClustering models using the into
keyword. If you want to be able to predict cluster assignments for future data, you can combine the SpectralClustering algorithm with any clustering algorithm (for example, first cluster the data using SpectralClustering, then fit a classifier to predict the cluster using RandomForestClassifier).
Preprocessing
Preprocessing algorithms are used for preparing data.
StandardScaler
The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This preprocessing step helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms, and is practically required for some algorithms, such as KernelPCA and SVM. This algorithm supports incremental fit.
Syntax
fit StandardScaler <fields> [into <model name>] [with_mean=<true|false>] [with_std=<true|false>] [partial_fit=<true|false>]
Example
... | fit StandardScaler * | ...
- The
with_mean
andwith_std
parameters specify if the fields should be standardized with respect to their mean and standard deviation, respectively. - The
partial_fit
parameter controls whether an existing model should be incrementally updated or not (default is "false"). This allows you to update an existing model using only new data without having to retrain it on the full training data set. Example usingpartial_fit
:| inputlookup track_day.csv | fit StandardScaler "batteryVoltage", "engineCoolantTemperature", "engineSpeed" partial_fit=true into My_Incremental_Model
In the example above, ifMy_Incremental_Model
does not exist, the model is saved to it. IfMy_Incremental_Model
exists and was trained using StandardScaler, the command updates the existing model with the new input. IfMy_Incremental_Model
exists but was not trained by StandardScaler, an error message will be given. Usingpartial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. Ifpartial_fit=false
orpartial_fit
is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.
You can save StandardScaler models using the into
keyword and apply new data later using the apply
command (for example, ... | apply scaling_model
).
You can inspect the statistics extracted by StandardScaler with the summary
command (for example, | summary scaling_model
).
Time Series Analysis
Time series analysis algorithms provide methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data and forecast its future values.
ARIMA
The Autoregressive Integrated Moving Average (ARIMA) algorithm uses the StatsModels ARIMA algorithm to fit a model on a time series for better understanding and/or forecasting its future values. An ARIMA model can consist of autoregressive terms, moving average terms, and differencing operations. The autoregressive terms express the dependency of the current value of time series to its previous ones. The moving average terms model the effect of previous forecast errors (also called random shocks or white noise) on the current value. If the time series is non-stationary, differencing operations are used to make it stationary. A stationary process is a stochastic process that its probability distribution does not change over time.
Syntax
fit ARIMA [_time] <field_to_forecast> order=<N>-<N>-<N> [forecast_k=<N>] [conf_interval=<N>] [holdback=<N>]
Example
... | fit ARIMA Voltage order=4-0-1 holdback=10 forecast_k=10
ARIMA requires order
to be specified at fitting time. order
needs three values:
- Number of autoregressive (AR) parameters
- Number of differencing operations (D)
- Number of moving average (MA) Parameters
It also supports the following parameters:
forecast_k=<int>
: Tells ARIMA how many points into the future should be forecasted. If_time
is specified during fitting along with the<field_to_forecast>
, ARIMA will also generate the timestamps for forecasted values. By default,forecast_k
is zero.conf_interval=<1..99>
: This is the confidence interval in percentage around forecasted values. By default it is set to 95%.holdback=<int>
: This is the number of data points held back from the ARIMA model. This can be useful when you want to compare the forecast against known data points. By default, holdback is zero.
Best Practices
- It is highly recommended to send the time series through timechart before sending it into ARIMA to avoid non-uniform sampling time. If
_time
is not to be specified, using timechart is not necessary. - The time series should not have any gaps or missing data otherwise ARIMA will complain. If there are missing samples in the data, using a bigger span in timechart or using streamstats to fill in the gaps with average values can do the trick.
- ARIMA supports one time series at a time.
- ARIMA models cannot be saved and used at a later time in the current version.
- When chaining ARIMA output to another algorithm (i.e. ARIMA itself), keep in mind the length of the data is the length of the original data +
forecast_k
. If you want to maintain theholdback
position, you need to add the number inforecast_k
to yourholdback
value.
See the StatsModels documentation at http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.arima_model.ARIMA.html for more information.
Utility Algorithms
Utility algorithms are not machine learning algorithms, but they provide methods to calculate data characteristics. These algorithms facilitate the process of algorithm selection and parameter selection.
Autocorrelation Function
Autocorrelation Function (ACF) calculates the correlation between a sequence and a shifted copy of itself, as a function of shift. Shift is also called lag or delay.
Syntax
fit ACF <field> [k=<N>] [fft=true|false] [conf_interval=<N>]
Example
... | fit ACF logins k=50 fft=true conf_interval=90
- The
k
parameter specifies the number of lags to return autocorrelation for. By default,k
is 40. - The
fft
parameter specifies whether ACF is computed via Fast Fourier Transform (FFT). By default,fft
is false. - The
conf_interval
parameter specifies the confidence interval in percentage to return. By default,conf_interval
is set to 95.
See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html for more information.
Partial Autocorrelation Function
Partial Autocorrelation Function (PACF) gives the partial correlation between a sequence and its lagged values, controlling for the values of lags that are shorter than its own.
Syntax
fit PACF <field> [k=<N>] [method=<ywunbiased|ywmle|ols>] [conf_interval=<N>]
Example
... | fit PACF logins k=20 conf_interval=90
- The
k
parameter specifies the number of lags to return partial autocorrelation for. By default,k
is 40 - The
method
parameter specifies which method for the calculation to use. By default,method
is ywunbiased. - The
conf_interval
parameter specifies the confidence interval in percentage to return. By default,conf_interval
is set to 95.
See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.pacf.html for more information.
Configure the fit and apply commands | Models |
This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 2.3.0
Feedback submitted, thanks!