 Download topic as PDF

Algorithms

The Machine Learning Toolkit and Showcase App supports the following algorithms. In addition to the examples included in this app, you can find more examples of these algorithms on the Examples page on the scikit-learn website.

Classifiers

Classifier algorithms predict the value of a categorical field.

DecisionTreeClassifier

The DecisionTreeClassifier algorithm uses scikit-learn's DecisionTreeClassifier estimator to fit a model to predict the value of categorical fields.

Syntax

```   fit DecisionTreeClassifier <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<N>] [max_features=<str>] [min_samples_split=<N>] [max_leaf_nodes=<N>] [criterion=<gini|entropy>] [splitter=<best|random>]
```

Example

```   ... | fit DecisionTreeClassifier SLA_violation from * into sla_model | ...
```

For descriptions of the `max_depth`, `max_features`, `min_samples_split`, `max_leaf_nodes`, `criterion`, and `splitter` parameters, see the scikit-learn documentation.

You can save DecisionTreeClassifier models using the `into` keyword and apply it to new data later using the `apply` command (for example, `... | apply model_DTC`).

You can inspect the decision tree learned by DecisionTreeClassifier with the `summary` command (for example, `| summary model_DTC`). Furthermore, you can get a JSON representation of the tree by giving `json=t` as an argument to the `summary` command (for example, `| summary model_DTC json=t`). To specify the maximum depth of the tree to summarize, use the `limit` argument (for example, `| summary model_DTC limit=10`).

LogisticRegression

The LogisticRegression algorithm uses scikit-learn's LogisticRegression estimator to fit a model to predict the value of categorical fields.

Syntax

```   fit LogisticRegression <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>]
```

Example

```   ... | fit LogisticRegression SLA_violation from IO_wait_time into sla_model | ...
```

The `fit_intercept` parameter specifies whether the model should include an implicit intercept term (the default value is "true").

You can save LogisticRegression models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply sla_model`).

You can inspect the coefficients learned by LogisticRegression with the `summary` command (for example, `| summary sla_model`).

RandomForestClassifier

The RandomForestClassifier algorithm uses scikit-learn's RandomForestClassifier estimator to fit a model to predict the value of categorical fields.

Syntax

```   fit RandomForestClassifier <field_to_predict> from <explanatory_fields>
[into <model name>] [n_estimators=<N>] [max_depth=<N>]
[max_features=<str>] [min_samples_split=<N>]
```

Example

```   ... | fit RandomForestClassifier SLA_violation from * into sla_model | ...
```

For descriptions of the `n_estimators`, `max_depth`, `max_features`, and `min_samples_split` parameters, see the scikit-learn documentation.

You can save RandomForestClassifier models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply sla_model`).

You cannot inspect the model learned by RandomForestClassifier with the `summary` command.

SVM

The SVM algorithm uses scikit-learn's kernel-based SVC estimator to fit a model to predict the value of categorical fields. It uses the radial basis function (rbf) kernel by default.

Syntax

```   fit SVM <field_to_predict> from <explanatory_fields> [into <model name>] [C=<number>] [gamma=<number>]
```

Example

```   ... | fit SVM SLA_violation from * into sla_model | ...
```

The `gamma` parameter controls the width of the rbf kernel (the default value is 1 / <number of fields>), and the `C` parameter controls the degree of regularization when fitting the model (the default value is 1.0).

For descriptions of the `C` and `gamma` parameters, see the scikit-learn documentation.

Note Kernel-based methods such as scikit-learn's SVC tend to work best when the data is scaled (for example, using our StandardScaler algorithm, `| fit StandardScaler <fields> into scaling_model | fit SVM <response> from <fields> into svm_model`). For details, see A Practical Guide to Support Vector Classification.

You can save SVM models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply sla_model`).

You cannot inspect the model learned by SVM with the `summary` command.

BernoulliNB

The BernoulliNB algorithm uses scikit-learn's BernoulliNB estimator (an implementation of the Naive Bayes classification algorithm) to fit a model to predict the value of categorical fields, where explanatory variables are assumed to be binary-valued.

Syntax

```   fit BernoulliNB <field_to_predict> from <explanatory_fields> [into <model name>]
[alpha=<float>] [binarize=<float>] [fit_prior=<true|false>]
```

Example

```   ... | fit BernoulliNB type from * into TESTMODEL_BernoulliNB alpha=0.5 binarize=0 fit_prior=f
```

The `alpha` parameter controls Laplace/Lidstone smoothing (the default value is 1.0). The `binarize` parameter is a threshold that can be used for converting numeric field values to the binary values expected by BernoulliNB (the default value is 0). For instance, if `binarize=0` is specified (the default), values > 0 are assumed to be 1, and values <= 0 are assumed to be 0. The `fit_prior` Boolean parameter specifies whether to learn class prior probabilities (the default value is "true"). If `fit_prior=f` is specified, classes are assumed to have uniform popularity.

You can save BernoulliNB models using the `into` keyword and apply the saved model later to new data using the `apply` command (for example, `... | apply TESTMODEL_BernoulliNB`).

You cannot inspect the model learned by BernoulliNB with the `summary` command.

GaussianNB

The GaussianNB algorithm uses scikit-learn's GaussianNB estimator (an implementation of Gaussian Naive Bayes classification algorithm) to fit a model to predict the value of categorical fields, where the likelihood of explanatory variables is assumed to be Gaussian.

Syntax

```   fit GaussianNB <field_to_predict> from <explanatory_fields> [into <model name>]
```

Example

```   ... | fit GaussianNB species from * into TESTMODEL_GaussianNB
```

GaussianNB algorithm doesn't require any input parameters. You can save GaussianNB models using the `into` keyword and apply the saved model later to new data using the `apply` command (for example, `... | apply TESTMODEL_GaussianNB`).

You cannot inspect the model learned by GaussianNB with the `summary` command.

Regressors

Regressor algorithms predict the value of a numeric field.

DecisionTreeRegressor

The DecisionTreeRegressor algorithm uses scikit-learn's DecisionTreeRegressor estimator to fit a model to predict the value of numeric fields.

Syntax

```   fit DecisionTreeRegressor <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<N>] [max_features=<str>] [min_samples_split=<N>] [max_leaf_nodes=<N>] [splitter=<best|random>]
```

Example

```   ... | fit DecisionTreeRegressor temperature from date_mongth date_hour into temperature_model | ...
```

For descriptions of the `max_depth`, `max_features`, `min_samples_split`, `max_leaf_nodes`, and `splitter` parameters, see the scikit-learn documentation.

You can save DecisionTreeRegressor models using the `into` keyword and apply it to new data later using the `apply` command (for example, `... | apply model_DTR`).

You can inspect the decision tree learned by DecisionTreeRegressor with the `summary` command (for example, `| summary model_DTR`). Furthermore, you can get a JSON representation of the tree by giving `json=t` as an argument to the `summary` command (for example, `| summary model_DTR json=t`). To specify the maximum depth of the tree to summarize, use the `limit` argument (for example, `| summary model_DTC limit=10`).

KernelRidge

The KernelRidge algorithm uses scikit-learn's KernelRidge algorithm to fit a model to predict numeric fields. This algorithm uses the radial basis function (rbf) kernel by default.

Syntax

```   fit KernelRidge <field_to_predict> from <explanatory_fields> [into <model_name>] [gamma=<number>]
```

Example

```   ... | fit KernelRidge temperature from date_month date_hour into temperature_model | ...
```

The `gamma` parameter controls the width of the rbf kernel (the default value is 1/number of fields). For details, see the scikit-learn documentation.

You can save KernelRidge models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply sla_model`).

You cannot inspect the model learned by KernelRidge with the `summary` command.

LinearRegression

The LinearRegression algorithm uses scikit-learn's LinearRegression estimator to fit a model to predict the value of numeric fields.

Syntax

```   fit LinearRegression <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>]
```

Example

```   ... | fit LinearRegression temperature from date_month date_hour into temperature_model | ...
```

The `fit_intercept` parameter specifies whether the model should include an implicit intercept term (the default value is "true").

You can save LinearRegression models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply temperature_model`).

You can inspect the coefficients learned by LinearRegression with the `summary` command (for example, `| summary temperature_model`).

RandomForestRegressor

The RandomForestRegressor algorithm uses scikit-learn's RandomForestRegressor estimator to fit a model to predict the value of numeric fields.

Syntax

```   fit RandomForestRegressor <field_to_predict> from <explanatory_fields>
[into <model name>] [n_estimators=<N>] [max_depth=<N>]
[max_features=<str>] [min_samples_split=<N>]
```

Example

```   ... | fit RandomForestRegressor temperature from date_month date_hour into temperature_model | ...
```

For descriptions of the `n_estimators`, `max_depth`, `max_features`, and `min_samples_split` parameters, see the scikit-learn documentation.

You can save RandomForestRegressor models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply temperature_model`).

You cannot inspect the model learned by RandomForestRegressor with the `summary` command.

Lasso

The Lasso algorithm uses scikit-learn's Lasso estimator to fit a model to predict the value of numeric fields. Lasso is like LinearRegression, but it uses L1 regularization to learn a linear models with fewer coefficients and smaller coefficients. Lasso models are consequently more robust to noise and resilient against overfitting.

Syntax

```   fit Lasso <field_to_predict> from <explanatory_fields>
[into <model name>] [alpha=<float>]
```

Example

```   ... | fit Lasso temperature from date_month date_hour  | ...
```

The `alpha` parameter controls the degree of L1 regularization. For details, see the scikit-learn documentation.

You can save Lasso models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply temperature_model`).

You can inspect the coefficients learned by Lasso with the `summary` command (for example, `| summary temperature_model`).

ElasticNet

The ElasticNet algorithm uses scikit-learn's ElasticNet estimator to fit a model to predict the value of numeric fields. ElasticNet is a linear regression model that includes both L1 and L2 regularization (it is a generalization of Lasso and Ridge).

Syntax

```   fit ElasticNet <field_to_predict> from <explanatory_fields>
[into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>]
[alpha=<N>] [l1_ratio=<N>]
```

Example

```   ... | fit ElasticNet temperature from date_month date_hour normalize=true alpha=0.5 | ...
```

For descriptions of the `fit_intercept`, `normalize`, `alpha`, and `l1_ratio` parameters, see the scikit-learn documentation.

You can save ElasticNet models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply temperature_model`).

You can inspect the coefficients learned by ElasticNet with the `summary` command (for example, `| summary temperature_model`).

Ridge

The Ridge algorithm uses scikit-learn's Ridge estimator to fit a model to predict the value of numeric fields. Ridge is like LinearRegression, but it uses L2 regularization to learn a linear models with smaller coefficients, making the algorithm more robust to collinearity.

Syntax

```   fit Ridge <field_to_predict> from <explanatory_fields>
[into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>]
[alpha=<N>]
```

Example

```   ... | fit Ridge temperature from date_month date_hour normalize=true alpha=0.5 | ...
```

The `alpha` parameter specifies the degree of regularization (the default value is 1.0). For descriptions of the `fit_intercept`, `normalize`, and `alpha` parameters, see the scikit-learn documentation.

You can save Ridge models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply temperature_model`).

You can inspect the coefficients learned by Ridge with the `summary` command (for example, `| summary temperature_model`).

Feature Extraction

Feature extraction algorithms transform fields for better prediction accuracy.

FieldSelector

The FieldSelector algorithm uses scikit-learn's GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.

Syntax

```   fit FieldSelector <field_to_predict> from <explanatory_fields>
[into <model name>] [type=<'categorical', 'numerical'>]
[mode=<'k_best', 'fpr', 'fdr', 'fwe', 'percentile'>] [param=<N>]
```

Example

```   ... | fit FieldSelector type=categorical SLA_violation from * into sla_model | ...
```

The `type` parameter specifies if the field to predict is categorical or numerical. For descriptions of the `mode` and `param` parameters, see the scikit-learn documentation.

You can save FieldSelector models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply sla_model`).

You cannot inspect the model learned by FieldSelector with the `summary` command.

TFIDF

The TFIDF algorithm uses scikit-learn's TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data.

Syntax

```   fit TFIDF <field_to_convert> [into <model name>] [max_features=<N>]
[max_df=<N>] [min_df=<N>] [ngram_range=<str>]
[analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=<None|english>]
```

Example

```   ... | fit TFIDF Reviews into user_feedback_model max_df=0.6 min_df=0.2 | ...
```

For descriptions of the `max_features`, `max_df`, `min_df`, `ngram_range`, `analyzer`, `norm`, and `token_pattern` parameters, see the scikit-learn documentation.

To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set `stop_words` to `english`. For other languages (for example, machine language) you can ignore the common words by setting `max_df` to a value greater than or equal to 0.7 and less than 1.0.

You can save TFIDF models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply user_feedback_model`).

You cannot inspect the model learned by TFIDF with the `summary` command.

PCA

The PCA algorithm uses scikit-learn's PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data.

Syntax

```   fit PCA <fields> [into <model name>] [k=<N>]
```

Example

```   ... | fit PCA * k=3 | ...
```

The `k` parameter specifies the number of features to be extracted from the data.

You can save PCA models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply user_feedback_model`).

You cannot inspect the model learned by PCA with the `summary` command.

KernelPCA

The KernelPCA algorithm uses scikit-learn's KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. The difference between KernelPCA and PCA is the use of kernels in the former, which helps with finding nonlinear dependencies among the fields. Currently, KernelPCA only supports the Radial Basis Function (rbf) kernel.

Syntax

```   fit KernelPCA <fields> [into <model name>] [k=<N>] [gamma=<N>] [tolerance=<N>] [max_iteration=<N>]
```

Example

```   ... | fit KernelPCA * k=3 gamma=0.001 | ...
```

The `k` parameter specifies the number of features to be extracted from the data. The other parameters are for fine tuning of the kernel. For descriptions of the `gamma`, `tolerance`, and `max_iteration` parameters, see the scikit-learn documentation.

You can save KernelPCA models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply user_feedback_model`).

You cannot inspect the model learned by KernelPCA with the `summary` command.

Anomaly Detectors

Anomaly detection algorithms detect anomalies and outliers in numerical or categorical fields.

OneClassSVM

The OneClassSVM algorithm uses scikit-learn's OneClassSVM (an unsupervised outlier detection method) to fit a model from the set of features (i.e. fields) for detecting anomalies/outliers, where features are expected to contain numerical values.

Syntax

```   fit OneClassSVM <fields> [into <model name>]
[kernel=<str>] [nu=<float>] [coef0=<float>]
[gamma=<float>] [tol=<float>] [degree=<number>] [shrinking=<true|false>]
```

Example

```   ... | fit OneClassSVM * kernel="poly" nu=0.5 coef0=0.5 gamma=0.5 tol=1 degree=3 shrinking=f into TESTMODEL_OneClassSVM
```

The `kernel` parameter specifies the kernel type ("linear", "rbf", "poly", "sigmoid") for using in the algorithm, where the default value of kernel="rbf". We can specify the upper bound on the fraction of training error as well as the lower bound of the fraction of support vectors using the `nu` parameter, where the default value is 0.5. The `degree` parameter is ignored by all kernels except the polynomial kernel, where the default value is 3. "gamma" is the kernel co-efficient that specifies how much influence a single data instance has, where the default value is 1/numberOfFeatures. "coef0" is the independent term in the kernel function which is only significant if we have polynomial or sigmoid function. "tol" is the tolerance for stopping criteria. The `shrinking` parameter tells us whether to use the shrinking heuristic. For details, see Kernel functions.

You can save OneClassSVM models using the `into` keyword and apply the saved model later to new data using the `apply` command (for example, `... | apply TESTMODEL_OneClassSVM`). After running the `fit` or `apply` command, a new field named "isNormal" is generated that defines whether a particular record (row) is normal ("isNormal"=1) or anomalous ("isNormal"=-1).

You cannot inspect the model learned by OneClassSVM with the `summary` command.

Clustering Algorithms

Clustering algorithms separate results into clusters.

KMeans

The KMeans algorithm uses scikit-learn's KMeans clustering algorithm to divide a result set into "k" distinct clusters. The cluster for each event is set in a new field named "cluster".

Syntax

```   fit KMeans <fields> [k=<N>] [as <output_field>]
```

Example

```   ... | fit KMeans * k=3 | stats count by cluster
```

The `k` parameter specifies the number of clusters to divide the data into. By default, the cluster label is assigned to a field named "cluster", but you may change that behavior by using the `as` keyword to specify a different field name.

You can save KMeans models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply cluster_model`).

You cannot inspect the model learned by KMeans with the `summary` command.

DBSCAN

The DBSCAN algorithm uses scikit-learn's DBSCAN clusterer to divide a result set into distinct clusters. The cluster for each event is set in a new field named "cluster". DBSCAN is distinct from KMeans in that it clusters results based on local density, and uncovers a variable number of clusters, whereas KMeans finds a precise number of clusters (for example, k=5 finds 5 clusters).

Syntax

```   fit DBSCAN <fields> [eps=<number>] [as <output_field>]
```

Example

```   ... | fit DBSCAN * | stats count by cluster
```

The `eps` parameter specifies the maximum distance between two samples for them to be considered in the same cluster. By default, the cluster label is assigned to a field named "cluster", but you may change that behavior by using the `as` keyword to specify a different field name.

You cannot save DBSCAN models using the `into` keyword. If you want to be able to predict cluster assignments for future data, you can combine the DBSCAN algorithm with any clustering algorithm (for example, first cluster the data using DBSCAN, then fit a classifier to predict the cluster using RandomForestClassifier).

Birch

The Birch algorithm uses scikit-learn's Birch clustering algorithm to divide a result set into set of distinct clusters. The cluster for each event is set in a new field named "cluster".

Syntax

```   fit Birch <fields> [k=<N>] [as <output_field>]
```

Example

```   ... | fit Birch * k=3 | stats count by cluster
```

The `k` parameter specifies the number of clusters to divide the data into after the final clustering step, which treats the subclusters from the leaves of the CF tree as new samples. By default, the cluster label is assigned to a field named "cluster", but user can change that behavior by using the `as` keyword to specify a different field name.

You can save Birch models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply Birch_model`).

You cannot inspect the model learned by Birch with the `summary` command.

SpectralClustering

The SpectralClustering algorithm uses scikit-learn's SpectralClustering clustering algorithm to divide a result set into set of distinct clusters. SpectralClustering first transforms the input data using the Radial Basis Function (rbf) kernel, and then performs KMeans clustering on the result. Consequently, SpectralClustering can learn clusters with a non-convex shape. The cluster for each event is set in a new field named "cluster".

Syntax

```   fit SpectralClustering <fields> [k=<N>] [gamma=<float>] [as <output_field>]
```

Example

```   ... | fit SpectralClustering * k=3 | stats count by cluster
```

The `k` parameter specifies the number of clusters to divide the data into after kernel step. By default, the cluster label is assigned to a field named "cluster", but you can change that behavior by using the `as` keyword to specify a different field name.

You cannot save SpectralClustering models using the `into` keyword. If you want to be able to predict cluster assignments for future data, you can combine the SpectralClustering algorithm with any clustering algorithm (for example, first cluster the data using SpectralClustering, then fit a classifier to predict the cluster using RandomForestClassifier).

Preprocessing

Preprocessing algorithms are used for preparing data.

StandardScaler

The StandardScaler algorithm uses scikit-learn's StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This preprocessing step helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms, and is practically required for some algorithms, such as SVM.

Syntax

```   fit StandardScaler <fields> [into <model name>] [with_mean=<true|false>] [with_std=<true|false>]
```

Example

```   ... | fit StandardScaler *  | ...
```

The `with_mean` and `with_std` parameters specify if the fields should be standardized with respect to their mean and standard deviation, respectively.

You can save StandardScaler models using the `into` keyword and apply new data later using the `apply` command (for example, `... | apply scaling_model`).

You can inspect the statistics extracted by StandardScaler with the `summary` command (for example, `| summary scaling_model`).

 PREVIOUS Search commands for machine learning NEXT Custom visualizations

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 1.2.0, 1.3.0