**here**for the latest version.

# Algorithms

The Splunk Machine Learning Toolkit supports the algorithms listed here. In addition to the examples included in the Splunk Machine Learning Toolkit, you can find more examples of these algorithms on the scikit-learn website.

The following algorithms use the `fit`

and `apply`

commands within the Splunk Machine Learning Toolkit. For information on the steps taken by these commands, please review the Understanding the fit and apply commands document.

Looking for information on using the `score`

command? Please navigate to the score command documentation for details.

### ML-SPL Quick Reference Guide

Download the Machine Learning Toolkit Quick Reference Guide for a handy cheat sheet of ML-SPL commands and machine learning algorithms used in the Splunk Machine Learning Toolkit. This document is also available in Japanese.

### ML-SPL Performance App

Download the ML-SPL Performance App for the Machine Learning Toolkit to use performance results for guidance and benchmarking purposes in your own environment.

### Extend the algorithms you can use for your models

The algorithms listed here and in the ML-SPL Quick Reference Guide are available natively in the Splunk Machine Learning Toolkit. You can also base your algorithm on over 300 open source Python algorithms from scikit-learn, pandas, statsmodel, numpy and scipy libraries available through the Python for Scientific Computing add-on in Splunkbase.

For information on how to import an algorithm from the Python for Scientific Computing add-on into the Splunk Machine Learning Toolkit, see the ML-SPL API Guide.

## Anomaly Detection

Anomaly detection algorithms detect anomalies and outliers in numerical or categorical fields.

### LocalOutlierFactor

The LocalOutlierFactor algorithm uses the scikit-learn Local Outlier Factor (LOF) to measure the local deviation of density of a given sample with respect to its neighbors. LocalOutlierFactor is an unsupervised outlier detection method. The anomaly score depends on how isolated the object is with respect to its neighbors.

**Syntax**

fit LocalOutlierFactor <fields> [n_neighbors=<int>] [leaf_size=<int>] [p=<int>] [contamination=<float>] [metric=<str>] [algorithm=<str>] [anomaly_score=<true|false>]

- The
`anomaly_score`

parameter default is`True`

. Disable this default by adding the`False`

keyword to the command. - The
`n_neighbors`

parameter default is 20 - The
`leaf_size`

parameter default is 30 - The
`p`

parameter is limited to`p >=1`

- The
`contamination`

parameter must be within the range of 0.0 (not included) to 0.5 (included) - The
`contamination`

parameter default is 0.1 - Valid algorithms include:
`brute`

,`kd_tree`

,`ball_tree`

and`auto`

- The
`algorithm`

default is auto - The
`metric`

default is minkowski - Valid metrics for kd_tree include:
`cityblock`

,`euclidean`

,`l1`

,`l2`

,`manhattan`

,`chebyshev`

,`minkowski`

- Valid metrics for ball_tree:
`cityblock`

,`euclidean`

,`l1`

,`l2`

,`manhattan`

,`chebyshev`

,`minkowski`

,`braycurtis`

,`canberra`

,`dice`

,`hamming`

,`jaccard`

,`kulsinski`

,`matching`

,`rogerstanimoto`

,`russellrao`

,`sokalmichener`

,`sokalsneath`

- Valid metrics for brute:
`cityblock`

,`euclidean`

,`l1`

,`l2`

,`manhattan`

,`chebyshev`

,`minkowski`

,`braycurtis`

,`canberra`

,`dice`

`hamming`

,`jaccard`

,`kulsinski`

,`matching`

,`rogerstanimoto`

,`russellrao`

,`sokalmichener`

,`sokalsneath`

,`cosine`

,`correlation`

,`sqeuclidean`

,`yule`

- Output: A list of labels titled
`is_outlier`

, assigned`1`

for outliers and`-1`

for inliers

**Example**

| inputlookup iris.csv | fit LocalOutlierFactor petal_length petal_width n_neighbors=10 algorithm=kd_tree metric=minkowski p=1 contamination=0.14 leaf_size=10

For descriptions of the `n_neighbors`

, `leaf_size`

and other parameters, see the sci-kit learn documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

You cannot save LocalOutlierFactor models using the `into`

keyword. This algorithm does not support saving models. It does not include the `predict`

method.

Using the LocalOutlierFactor algorithm requires running version 1.3 of Python for Scientific Computing.

### OneClassSVM

The OneClassSVM algorithm uses the scikit-learn OneClassSVM (an unsupervised outlier detection method) to fit a model from the set of features (i.e. fields) for detecting anomalies/outliers, where features are expected to contain numerical values.

**Syntax**

fit OneClassSVM <fields> [into <model name>] [kernel=<str>] [nu=<float>] [coef0=<float>] [gamma=<float>] [tol=<float>] [degree=<int>] [shrinking=<true|false>]

- The
`kernel`

parameter specifies the kernel type ("linear", "rbf", "poly", "sigmoid") for using in the algorithm, where the default value is kernel is`rbf`

. * You can specify the upper bound on the fraction of training error as well as the lower bound of the fraction of support vectors using the`nu`

parameter, where the default value is 0.5. - The
`degree`

parameter is ignored by all kernels except the polynomial kernel, where the default value is 3.`gamma`

is the kernel co-efficient that specifies how much influence a single data instance has, where the default value is 1/numberOfFeatures. The independent term of`coef0`

in the kernel function is only significant if you have polynomial or sigmoid function. The term`tol`

is the tolerance for stopping criteria. - The
`shrinking`

parameter determines whether to use the shrinking heuristic. For details, see http://scikit-learn.org/stable/modules/svm.html#kernel-functions.

**Example**

... | fit OneClassSVM * kernel="poly" nu=0.5 coef0=0.5 gamma=0.5 tol=1 degree=3 shrinking=f into TESTMODEL_OneClassSVM

You can save OneClassSVM models using the `into`

keyword and apply the saved model later to new data using the `apply`

command. See the example below.

**Example**

... | apply TESTMODEL_OneClassSVM

After running the `fit`

or `apply`

command, a new field named `isNormal`

is generated. This field defines whether a particular record (row) is normal (`isNormal=1`

) or anomalous (`isNormal=-1`

).

You cannot inspect the model learned by OneClassSVM with the `summary`

command.

## Classifiers

Classifier algorithms predict the value of a categorical field.

The `kfold`

cross-validation command is available for all Classifier algorithms. Learn more here.

### BernoulliNB

The BernoulliNB algorithm uses the scikit-learn BernoulliNB estimator (an implementation of the Naive Bayes classification algorithm) to fit a model to predict the value of categorical fields where explanatory variables are assumed to be binary-valued. This algorithm supports incremental fit.

**Syntax**

fit BernoulliNB <field_to_predict> from <explanatory_fields> [into <model name>] [alpha=<float>] [binarize=<float>] [fit_prior=<true|false>] [partial_fit=<true|false>]

- The
`alpha`

parameter controls Laplace/ Lidstone smoothing. The default value is 1.0. - The
`binarize`

parameter is a threshold that can be used for converting numeric field values to the binary values expected by BernoulliNB. The default value is 0. For instance, if`binarize=0`

is specified, the default, values > 0 are assumed to be 1, and values <= 0 are assumed to be 0. - The
`fit_prior`

Boolean parameter specifies whether to learn class prior probabilities. The default value is`true`

. If`fit_prior=f`

is specified, classes are assumed to have uniform popularity. - The
`partial_fit`

parameter controls whether an existing model should be incrementally updated or not. The default value is`false`

, meaning it will not be incrementally updated. Choosing`partial_fit=true`

allows you to update an existing model using only new data without having to retrain it on the full training data set.- Using
`partial_fit=true`

on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If`partial_fit=false`

or`partial_fit`

is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.

- Using

**Example 1**

... | fit BernoulliNB type from * into TESTMODEL_BernoulliNB alpha=0.5 binarize=0 fit_prior=f

**Example 2**

In the following example, if `My_Incremental_Model`

does not exist, the command saves the model data under the model name `My_Incremental_Model`

. If `My_Incremental_Model`

exists and was trained using BernoulliNB, the command updates the existing model with the new input. If `My_Incremental_Model`

exists but was not trained by BernoulliNB, an error message is thrown.

| inputlookup iris.csv | fit BernoulliNB species from * partial_fit=true into My_Incremental_Model

You can save BernoulliNB models using the `into`

keyword and apply the saved model later to new data using the `apply`

command. See the following example.

**Example**

|... | apply TESTMODEL_BernoulliNB

You can inspect the model learned by BernoulliNB with the `summary`

command as well as view the class and log probability information as calculated by the dataset. See the following example:

**Example**

| summary My_Incremental_Model

### DecisionTreeClassifier

The DecisionTreeClassifier algorithm uses the scikit-learn DecisionTreeClassifier estimator to fit a model to predict the value of categorical fields.

**Syntax**

fit DecisionTreeClassifier <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<int>] [max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>] [criterion=<gini|entropy>] [splitter=<best|random>] [random_state=<int>]

**Example**

... | fit DecisionTreeClassifier SLA_violation from * into sla_model | ...

For descriptions of the `max_depth`

, `max_features`

, `min_samples_split`

, `max_leaf_nodes`

, `criterion`

, `random_state`

, and `splitter`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.

You can save DecisionTreeClassifier models by using the `into`

keyword and apply it to new data later by using the `apply`

command. See the following example.

**Example**

... | apply model_DTC

You can inspect the decision tree learned by DecisionTreeClassifier with the `summary`

command. See the following example.

**Example**

| summary model_DTC

You can also get a JSON representation of the tree by giving `json=t`

as an argument to the `summary`

command. See the following example.

| summary model_DTC json=t

To specify the maximum depth of the tree to summarize, use the `limit`

argument. The default value for the `limit`

argument is 5. See the following example.

| summary model_DTC limit=10

### GaussianNB

The GaussianNB algorithm uses the scikit-learn GaussianNB estimator (an implementation of Gaussian Naive Bayes classification algorithm) to fit a model to predict the value of categorical fields, where the likelihood of explanatory variables is assumed to be Gaussian. This algorithm supports incremental fit.

**Syntax**

fit GaussianNB <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<true|false>]

**Example**

... | fit GaussianNB species from * into TESTMODEL_GaussianNB

The `partial_fit`

parameter controls whether an existing model should be incrementally updated or not (default is `false`

). This allows you to update an existing model using only new data without having to retrain it on the full training data set.

**Example**

The following example uses the `partial_fit`

command. In this example, if `My_Incremental_Model`

does not exist, the command saves the model data under the model name `My_Incremental_Model`

. If `My_Incremental_Model`

exists and was trained using GaussianNB, the command updates the existing model with the new input. If `My_Incremental_Model`

exists but was not trained by GaussianNB, an error message is thrown.

| inputlookup iris.csv | fit GaussianNB species from * partial_fit=true into My_Incremental_Model

If `partial_fit=false`

or `partial_fit`

is not specified (default is `false`

), the model specified is created and replaces the pre-trained model if one exists.

You can save GaussianNB models using the `into`

keyword and apply the saved model later to new data using the `apply`

command. See the following example:

**Example**

... | apply TESTMODEL_GaussianNB

You can inspect models learned by GaussianNB with the `summary`

command. See the following example.

**Example**

| summary My_Incremental_Model

### GradientBoostingClassifier

This algorithm uses the GradientBoostingClassifier from scikit-learn to build a classification model by fitting regression trees on the negative gradient of a deviance loss function. For documentation on the parameters, see GradientBoostingClassifier from scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.

**Syntax**

fit GradientBoostingClassifier <field_to_predict> from <explanatory_fields>[into <model name>] [loss=<deviance | exponential>] [max_features=<str>] [learning_rate =<float>] [min_weight_fraction_leaf=<float>] [n_estimators=<int>] [max_depth=<int>] [min_samples_split =<int>] [min_samples_leaf=<int>] [max_leaf_nodes=<int>] [random_state=<int>]

**Example**
The following example uses the GradientBoostingClassifier algorithm to fit a model and save a model as `TESTMODEL_GradientBoostingClassifier`

.

... | fit GradientBoostingClassifier target from * into TESTMODEL_GradientBoostingClassifier

Use the apply method to apply the trained model to the new data.

... |apply TESTMODEL_GradientBoostingClassifier

To inspect the features learned by GradientBoostingClassifier use the summary command.

| summary TESTMODEL_GradientBoostingClassifier

### LogisticRegression

The LogisticRegression algorithm uses the scikit-learn LogisticRegression estimator to fit a model to predict the value of categorical fields.

**Syntax**

fit LogisticRegression <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>] [probabilities=<true|false>]

**Example**

... | fit LogisticRegression SLA_violation from IO_wait_time into sla_model | ...

The `fit_intercept`

parameter specifies whether the model includes an implicit intercept term (the default value is `true`

).

The `probabilities`

parameter specifies whether probabilities for each possible field value should be returned alongside the predicted value (the default value is `false`

).

You can save LogisticRegression models using the `into`

keyword and apply new data later using the `apply`

command. See the following example.

**Example**

... | apply sla_model

You can inspect the coefficients learned by LogisticRegression with the `summary`

command (for example, `| summary sla_model`

).

### MLPClassifier

The MPLClassifier algorithm uses the scikit-learn Multi-layer Perceptron estimator for classification. MLPClassifier uses a feedforward artificial neural network model that trains using backpropagation. This algorithm supports incremental fit.

**Syntax**

fit MLPClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [batch_size=<int>] [max_iter=<int>] [random_state=<int>] [hidden_layer_sizes=<int>-<int>-<int>] [activation=<str>] [solver=<str>] [learning_rate=<str>] [tol=<float>} {momentum=<float>]

**Example**

... | inputlookup diabetes.csv | fit MLPClassifier response from * into MLP_example_model hidden_layer_sizes='100-100-80' |...

**Example**

The following example uses the `partial_fit`

command:

| inputlookup iris.csv | fit MLPClassifier species from * partial_fit=true into My_Example_Model

The `partial_fit`

parameter controls whether an existing model should be incrementally updated on not (default is `false`

). This allows you to update nan existing model using only new data without having to retrain it on the full training data set.

- If
`My_Example_Model`

does not exist, the model is saved to it. - If
`My_Example_Model`

exists and was trained using MLPClassifier, the command updates the existing model with the new input. - If
`My_Example_Model`

exists but was not trained using MLPClassifier, an error message will be given.

For descriptions of the `batch_size`

, `random_state`

and `max_iter`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

The `hidden_layer_sizes`

parameter format (`int`

) varies based on the number of hidden layers in the data.

You can save MLPClassifier models by using the `into`

keyword and apply it to new data later by using the `apply`

command.

You can inspect models learned by MLPClassifier with the `summary`

command. See the following example.

**Example**

| summary My_Example_Model

Using the MLPClassifier algorithm requires running version 1.3 of Python for Scientific Computing.

### RandomForestClassifier

The RandomForestClassifier algorithm uses the scikit-learn RandomForestClassifier estimator to fit a model to predict the value of categorical fields.

**Syntax**

fit RandomForestClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [n_estimators=<int>] [max_depth=<int>] [criterion=<gini | entropy>] [random_state=<int>] [max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]

**Example**

... | fit RandomForestClassifier SLA_violation from * into sla_model | ...

For descriptions of the `n_estimators`

, `max_depth`

, `criterion`

, `random_state`

, `max_features`

, `min_samples_split`

, and `max_leaf_nodes`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

You can save RandomForestClassifier models using the `into`

keyword and apply new data later using the `apply`

command. See the following example.

**Example**

... | apply sla_model

You can list the features that were used to fit the model, as well as their relative importance or influence with the `summary`

command (for example, `| summary sla_model`

).

### SGDClassifier

The SGDClassifier algorithm uses the scikit-learn SGDClassifier estimator to fit a model to predict the value of categorical fields. This algorithm supports incremental fit.

**Syntax**

fit SGDClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<true|false>] [loss=<hinge|log|modified_huber|squared_hinge|perceptron>] [fit_intercept=<true|false>]<int> [random_state=<int>] [n_iter=<int>] [l1_ratio=<float>] [alpha=<float>] [eta0=<float>] [power_t=<float>] [penalty=<l1|l2|elasticnet>] [learning_rate=<constant|optimal|invscaling>]

**Example**

... | fit SGDClassifier SLA_violation from * into sla_model

The SGDClassifier algorithm supports the following parameters:

`loss=<hinge|log|modified_huber|squared_hinge|perceptron>`

: The loss function to be used.- Defaults to
`hinge`

, which gives a linear SVM.

- Defaults to
`log`

loss gives logistic regression, a probabilistic classifier.`modified_huber`

is another smooth loss that brings tolerance to outliers as well as probability estimates.`squared_hinge`

is like hinge but is quadratically penalized.`perceptron`

is the linear loss used by the perceptron algorithm.`fit_intercept=<true|false>`

: Specifies whether the intercept should be estimated or not (default`true`

).`n_iter=<int>`

: The number of passes over the training data (aka epochs) (default 5). The number of iterations is set to 1 if using partial_fit.`penalty=<l2|l1|elasticnet>`

: The penalty (aka regularization term) to be used (default l2).`learning_rate=<constant|optimal|invscaling>`

The learning rate.`constant`

: eta = eta0,`optimal`

: eta = 1.0/(alpha * t),`invscaling`

: eta = eta0 / pow(t, power_t) (default`invscaling`

).`l1_ratio=<float>`

: The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1 (default 0.15). l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.`alpha=<float>`

: Constant that multiplies the regularization term (default 0.0001). Also used to compute learning_rate when set to`optimal`

.`eta0=<float>`

: The initial learning rate (default 0.01).`power_t=<float>`

: The exponent for inverse scaling learning rate (default 0.25).`random_state=<int>`

: The seed of the pseudo random number generator to use when shuffling the data.

**Example**

The following example uses the `partial_fit=<true|false>`

command which specifies whether an existing model should be incrementally updated or not (default setting is false). If `My_Incremental_Model`

does not exist, the command saves the model data under the model name `My_Incremental_Model`

. If `My_Incremental_Model`

exists and was trained using SGDClassifier, the command updates the existing model with the new input. If `My_Incremental_Model`

exists but was not trained by SGDClassifier, an error message is thrown.

partial_fit | inputlookup iris.csv | fit SGDClassifier species from * partial_fit=true into My_Incremental_Model

Using `partial_fit=true`

on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If `partial_fit=false`

or `partial_fit`

is not specified (default is `false`

), the model specified is created and replaces the pre-trained model if one exists.

You can save SGDClassifier models using the `into`

keyword and apply the saved model later to new data using the `apply`

command. See the following example:

**Example**

... | apply sla_model

You can inspect the model learned by SGDClassifier with the `summary`

command. See the following example:

**Example**

| summary sla_model

### SVM

The SVM algorithm uses the scikit-learn kernel-based SVC estimator to fit a model to predict the value of categorical fields. It uses the radial basis function (rbf) kernel by default.

**Syntax**

fit SVM <field_to_predict> from <explanatory_fields> [into <model name>] [C=<float>] [gamma=<float>]

**Example**

... | fit SVM SLA_violation from * into sla_model | ...

The `gamma`

parameter controls the width of the rbf kernel (the default value is 1 / `number of fields`

), and the `C`

parameter controls the degree of regularization when fitting the model (the default value is 1.0).

For descriptions of the `C`

and `gamma`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.

You can save SVM models using the `into`

keyword and apply new data later using the `apply`

command. See the following example.

**Example**

... | apply sla_model

You cannot inspect the model learned by SVM with the `summary`

command.

Kernel-based methods such as the scikit-learn SVC tend to work best when the data is scaled, for example, using our StandardScaler algorithm:
`| fit StandardScaler `

. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

## Clustering Algorithms

Clustering is the grouping of data points. Results will vary depending upon the clustering algorithm used. Clustering algorithms differ in how they determine if data points are similar, and should be grouped. For example, the K-means algorithm clusters based on points in space, whereas the DBSCAN algorithm clusters based on local density

### BIRCH

The BIRCH algorithm uses the scikit-learn BIRCH clustering algorithm to divide data points into set of distinct clusters. The cluster for each event is set in a new field named `cluster`

. This algorithm supports incremental fit.

**Syntax**

fit BIRCH <fields> [into <model name>] [k=<int>][partial_fit=<true|false>] [into <model name>]

**Example**

... | fit BIRCH * k=3 | stats count by cluster

- The
`k`

parameter specifies the number of clusters to divide the data into after the final clustering step, which treats the sub-clusters from the leaves of the CF tree as new samples.- By default, the cluster label field name is
`cluster`

. Change that behavior by using the`as`

keyword to specify a different field name.

- By default, the cluster label field name is
- The
`partial_fit`

parameter controls whether an existing model should be incrementally updated or not (default is`false`

).- These controls allow you to update an existing model using only new data, without having to retrain the model on the full training data set.

**Example**

The following example uses the `partial_fit`

command. if `My_Incremental_Model`

does not exist, the command saves the model data under the model name `My_Incremental_Model`

. If `My_Incremental_Model`

exists and was trained using BIRCH, the command updates the existing model with the new input. If `My_Incremental_Model`

exists but was not trained by BIRCH, an error message is thrown.

| inputlookup track_day.csv | fit BIRCH * k=6 partial_fit=true into My_Incremental_Model

Using `partial_fit=true`

on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If `partial_fit=false`

or `partial_fit`

is not specified (default is `false`

), the model specified is created and replaces the pre-trained model if one exists.

You can save BIRCH models using the `into`

keyword and apply new data later using the `apply`

command. See the following example:

**Example**

... | apply BIRCH_model

You cannot inspect the model learned by BIRCH with the `summary`

command.

### DBSCAN

The DBSCAN algorithm uses the scikit-learn DBSCAN clustering algorithm to divide a result set into distinct clusters. The cluster for each event is set in a new field named "cluster". DBSCAN is distinct from K-Means in that it clusters results based on local density, and uncovers a variable number of clusters, whereas K-Means finds a precise number of clusters (for example, k=5 finds 5 clusters).

**Syntax**

fit DBSCAN <fields> [eps=<number>]

**Example**

... | fit DBSCAN * | stats count by cluster

The `eps`

parameter specifies the maximum distance between two samples for them to be considered in the same cluster. By default, the cluster label is assigned to a field named "cluster", but you may change that behavior by using the `as`

keyword to specify a different field name.

You cannot save DBSCAN models using the `into`

keyword. To predict cluster assignments for future data, combine the DBSCAN algorithm with any classifier algorithm. For example, first cluster the data using DBSCAN, then fit RandomForestClassifier to predict the cluster.

### K-means

K-means clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable "k". The K-means algorithm uses the scikit-learn K-means implementation. The cluster for each event is set in a new field named `cluster`

. Use the K-means algorithm when you have unlabeled data and have at least approximate knowledge of the total number of groups into which the data can be divided.

Using the K-means algorithm has the following advantages:

- Computationally faster than most other clustering algorithms.
- Simple algorithm to explain and understand.
- Normally produces tighter clusters than hierarchical clustering.

Using the K-means algorithm has the following disadvantages:

- Difficult to determine optimal or true value of
`k`

. See X-means. - Sensitive to scaling. See StandardScaler.
- Each clustering may be slightly different, unless you specify the
`random_state`

parameter. - Does not work well with clusters of different sizes and density.

**Syntax**

fit KMeans <fields> [into <model name>] [k=<int>] [random_state=<int>]

**Example**

... | fit KMeans * k=3 | stats count by cluster

The `k`

parameter specifies the number of clusters to divide the data into. By default, the cluster label is assigned to a field named cluster, but you might change that behavior by using the `as`

keyword to specify a different field name.

You can save K-means models using the `into`

keyword when using the `fit`

command.

**Example**

In a separate search, you can apply the model to new data using the `apply`

command.

<code>... | apply cluster_model

**Example**

You can inspect the model with the `summary`

command.

... | summary cluster_model

For descriptions of default value of `K`

, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

### SpectralClustering

The SpectralClustering algorithm uses the scikit-learn SpectralClustering clustering algorithm to divide a result set into set of distinct clusters. SpectralClustering first transforms the input data using the Radial Basis Function (rbf) kernel, and then performs K-Means clustering on the result. Consequently, SpectralClustering can learn clusters with a non-convex shape. The cluster for each event is set in a new field named "cluster".

**Syntax**

fit SpectralClustering <fields> [k=<int>] [gamma=<float>] [random_state=<int>]

**Example**

... | fit SpectralClustering * k=3 | stats count by cluster

The `k`

parameter specifies the number of clusters to divide the data into after kernel step. By default, the cluster label is assigned to a field named "cluster", but you can change that behavior by using the `as`

keyword to specify a different field name.

You cannot save SpectralClustering models using the `into`

keyword. If you want to be able to predict cluster assignments for future data, you can combine the SpectralClustering algorithm with any clustering algorithm (for example, first cluster the data using SpectralClustering, then fit a classifier to predict the cluster using RandomForestClassifier).

### X-means

Use the X-means algorithm when you have unlabeled data and no prior knowledge of the total number of labels into which that data could be divided. The X-means clustering algorithm is an extended K-means that automatically determines the number of clusters based on Bayesian Information Criterion (BIC) scores. Starting with a single cluster, the X-means algorithm goes into action after each run of K-means, making local decisions about which subset of the current centroids should split themselves in order to fit the data better. The splitting decision is done by computing the BIC. The cluster for each event is set in a new field named cluster, and the total number of clusters is set in a new field named `n_clusters`

.

Using the X-means algorithm has the following advantages:

- Eliminates the requirement of having to provide the value of
`k`

(The total number of labels/clusters in the data). - Normally produces tighter clusters than hierarchical clustering.

Using the X-means algorithm has the following disadvantages:

- Sensitive to scaling. See StandardScaler.
- Different initializations might result in different final clusters.
- Does not work well with clusters of different sizes and density.

**Syntax**

fit XMeans <fields> [into <model name>]

**Example**

... | fit XMeans * | stats count by cluster

By default, the cluster label is assigned to a field named cluster, but you may change that behavior by using the `as`

keyword to specify a different field name. You can save X-means models using the `into`

keyword.

**Example**

You can then apply new data to the saved X-means model using the `apply`

command.

... | apply cluster_model

**Example**

You can inspect the model learned by X-means with the `summary`

command.

| summary cluster_model

## Feature Extraction

Feature extraction algorithms transform fields for better prediction accuracy.

### FieldSelector

The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.

**Syntax**

fit FieldSelector <field_to_predict> from <explanatory_fields> [into <model name>] [type=<categorical, numeric>] [mode=<k_best, fpr, fdr, fwe, percentile>] [param=<int>]

**Example**

... | fit FieldSelector type=categorical SLA_violation from * into sla_model | ...

The `type`

parameter specifies if the field to predict is categorical or numeric. For descriptions of the `mode`

and `param`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html.

You can save FieldSelector models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply sla_model

You cannot inspect the model learned by FieldSelector with the `summary`

command.

### HashingVectorizer

The HashingVectorizer algorithm converts text documents to a matrix of token occurrences. It uses a feature hashing strategy to allow for hash collisions when measuring the occurrence of tokens. It is a stateless transformer, meaning that it does not require building a vocabulary of the seen tokens. This reduces the memory footprint and allows for larger feature spaces. It can be compared most easily with the TFIDF algorithm, as they share many of the same parameters.

**Syntax**

fit HashingVectorizer <field_to_convert> [max_features=<int>] [n_iters=<int>] [reduce=<bool>] [k=<int>] [ngram_range=<int>-<int>] [analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]

**Example**

... | fit HashingVectorizer Reviews ngram_range=1-2 k=200 analyzer=char | ...

- The
`reduce`

parameter is either`true`

or`false`

and determines whether or not to reduce the output to a smaller dimension using TruncatedSVD. Default is`True`

. - The
`k=<int>`

parameter sets the number of dimensions to reduce when the`reduce`

parameter is set to`true`

. Default is 100. - The default for the
`max_features`

parameter is 10,000. - The
`n_iters`

parameter specifies the number of iterations to to use when performing dimensionality reduction. This is only used when the`reduce`

parameter is set to`True`

. Default is 5.

For descriptions of the `max_features`

, `max_df`

, `min_df`

, `ngram_range`

, `analyzer`

, `norm`

, and `token_pattern`

parameters, see the scikit-learn documentation at https://scikit-learn.org/0.19/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

HashingVectorizer does not support saving models, incremental fit, or K-fold cross validation.

### KernelPCA

The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. The difference between KernelPCA and PCA is the use of kernels in the former, which helps with finding nonlinear dependencies among the fields. Currently, KernelPCA only supports the Radial Basis Function (rbf) kernel.

**Syntax**

fit KernelPCA <fields> [into <model name>] [degree=<int>] [k=<int>] [gamma=<int>] [tolerance=<int>] [max_iteration=<int>]

**Example**

... | fit KernelPCA * k=3 gamma=0.001 | ...

The `k`

parameter specifies the number of features to be extracted from the data. The other parameters are for fine tuning of the kernel. For descriptions of the `gamma`

, `degree`

, `tolerance`

, and `max_iteration`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html.

You can save KernelPCA models using the `into`

keyword and apply new data later using the `apply`

command. See example below:

**Example**

... | apply user_feedback_model

You cannot inspect the model learned by KernelPCA with the `summary`

command.

Kernel-based methods such as the scikit-learn KernelPCA tend to work best when the data is scaled, for example, using our StandardScaler algorithm: `| fit StandardScaler `

. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

### PCA

The PCA algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data. The `k`

parameter specifies the number of features to be extracted from the data.

**Syntax**

fit PCA <fields> [into <model name>] [k=<int>] [variance=<float>]

**Example**

| fit PCA "SS_SMART_1_Raw", "SS_SMART_2_Raw", "SS_SMART_3_Raw", "SS_SMART_4_Raw", "SS_SMART_5_Raw" k=2 into example_hard_drives_PCA_2

The `variance`

parameter is short for percentage variance ratio explained. This parameter determines the percentage of variance ratio explained in the principal components of the PCA. It computes the number of principal components dynamically by preserving the specified variance ratio.

- The
`variance`

parameter defaults to 1 if k is not provided. - The
`variance`

parameter can take a value between 0 and 1.

The following example includes the `variance`

parameter. In this example `variance=0.5`

tells the algorithm to choose as many principal components for the data set until able to explain 50% of the variance in the original dataset.

| fit PCA "SS_SMART_1_Raw", "SS_SMART_2_Raw", "SS_SMART_3_Raw", "SS_SMART_4_Raw", "SS_SMART_5_Raw" variance=0.50 into example_hard_drives_PCA_2

The `variance`

parameter and k cannot be used together. They are mutually exclusive.

You can save PCA models using the `into`

keyword and apply new data later using the `apply`

command. See the following example.

**Example**

...into example_hard_drives_PCA_2 | apply example_hard_drives_PCA_2

You cannot inspect the model learned by PCA with the `summary`

command.

### TFIDF

The TFIDF algorithm uses the scikit-learn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data.

**Syntax**

fit TFIDF <field_to_convert> [into <model name>] [max_features=<int>] [max_df=<int>] [min_df=<int>] [ngram_range=<int>-<int>] [analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]

**Example**

... | fit TFIDF Reviews into user_feedback_model ngram_range=1-2 max_df=0.6 min_df=0.2 | ...

For descriptions of the `max_features`

, `max_df`

, `min_df`

, `ngram_range`

, `analyzer`

, `norm`

, and `token_pattern`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

The default for `max_features`

is 100.

To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set `stop_words`

to `english`

. For other languages (for example, machine language) you can ignore the common words by setting `max_df`

to a value greater than or equal to 0.7 and less than 1.0.

You can save TFIDF models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply user_feedback_model

You cannot inspect the model learned by TFIDF with the `summary`

command.

## Preprocessing (Prepare Data)

Preprocessing algorithms are used for preparing data. Other algorithms can also be used for preprocessing that may not be organized under this section. For example, PCA can be used for both Feature Extraction *and* Preprocessing.

### Imputer

The Imputer algorithm is a preprocessing step wherein missing data is replaced with substitute values. The substitute values can be estimated, or based on other statistics or values in the dataset. To use Imputer, the user passes in the names of the fields to impute, along with arguments specifying the imputation strategy, and the values representing missing data. Imputer then adds new imputed versions of those fields to the data, which are copies of the original fields, except that their missing values are replaced by values computed according to the imputation strategy.

**Syntax**

.. | fit Imputer <field>* [as <field prefix>] [missing_values=<"NaN"|integer>] [strategy=<mean|median|most_frequent>] [into <model name>]

- Available imputation strategies include
`mean`

,`median`

,`most_frequent`

, and`field`

.- The default strategy is
`mean`

.

- The default strategy is
- All but
`field`

require numeric data.- The
`field`

strategy accepts categorical data.

- The

**Example**

| inputlookup server_power.csv | eval ac_power_missing=if(random() % 3 = 0, null, ac_power) | fields - ac_power | fit Imputer ac_power_missing | eval imputed=if(isnull(ac_power_missing), 1, 0) | eval ac_power_imputed=round(Imputed_ac_power_missing, 1) | fields - ac_power_missing, Imputed_ac_power_missing

You can save Imputer models using the `into`

keyword and apply new data later using the `apply`

command. See the following example:

**Example**

... | apply <imputer model name>

You can inspect the value (mean, median, or mode) that was substituted for missing values by Imputer with the `summary`

command. See the following example:

**Example**

... | summary <imputer model name>

### RobustScaler

The RobustScaler algorithm uses the scikit-learn RobustScaler algorithm to standardize the data fields by scaling their median and interquartile range to 0 and 1, respectively. It is very similar to the StandardScaler algorithm, in that it helps avoid dominance of one or more fields over others in subsequent machine learning algorithms, and is practically required for some algorithms, such as KernelPCA and SVM. The main difference between StandardScaler and RobustScaler is that RobustScaler is less sensitive to outliers. This algorithm does not support incremental fit.

**Syntax**

fit RobustScaler <fields> [into <model name>] [with_centering=<true|false>] [with_scaling=<true|false>]

**Example**

... | fit RobustScaler * | ...

- The
`with_centering`

and`with_scaling`

parameters specify if the fields should be standardized with respect to their median and interquartile range.

You can save RobustScaler models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply scaling_model

You can inspect the statistics extracted by RobustScaler with the `summary`

command (for example, `| summary scaling_model`

).

### StandardScaler

The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize data fields by scaling their mean and standard deviation to 0 and 1, respectively. This preprocessing step helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms. This step is practically required for some algorithms, such as KernelPCA and SVM. This algorithm supports incremental fit.

**Syntax**

fit StandardScaler <fields> [into <model name>] [with_mean=<true|false>] [with_std=<true|false>] [partial_fit=<true|false>]

**Example**

... | fit StandardScaler * | ...

- The
`with_mean`

and`with_std`

parameters specify if the fields should be standardized with respect to their mean and standard deviation. - The
`partial_fit`

parameter controls whether an existing model should be incrementally updated or not (default is`false`

). This allows you to update an existing model using only new data without having to retrain it on the full training data set.

**Example**

The following example uses the`partial_fit`

parameter. In this example, if `My_Incremental_Model`

does not exist, the command saves the model data under the model name `My_Incremental_Model`

. If `My_Incremental_Model`

exists and was trained using StandardScaler, the command updates the existing model with the new input. If `My_Incremental_Model`

exists but was not trained by StandardScaler, an error message is thrown.

| inputlookup track_day.csv | fit StandardScaler "batteryVoltage", "engineCoolantTemperature", "engineSpeed" partial_fit=true into My_Incremental_Model

Using `partial_fit=true`

on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If `partial_fit=false`

or `partial_fit`

is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.

You can save StandardScaler models using the `into`

keyword and apply new data later using the `apply`

command. See the following example:

**Example**

... | apply scaling_model

You can inspect the statistics extracted by StandardScaler with the `summary`

command. See the following example:

**Example**

| summary scaling_model

## Regressors

Regressor algorithms predict the value of a numeric field.

The `kfold`

cross-validation command is available for all Regressor algorithms. Learn more here.

### DecisionTreeRegressor

The DecisionTreeRegressor algorithm uses the scikit-learn DecisionTreeRegressor estimator to fit a model to predict the value of numeric fields.

**Syntax**

fit DecisionTreeRegressor <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<int>] [max_features=<str>] [min_samples_split=<int>] [random_state=<int>] [max_leaf_nodes=<int>] [splitter=<best|random>]

**Example**

... | fit DecisionTreeRegressor temperature from date_month date_hour into temperature_model | ...

For descriptions of the `max_depth`

, `random_state`

, `max_features`

, `min_samples_split`

, `max_leaf_nodes`

, and `splitter`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html.

You can save DecisionTreeRegressor models using the `into`

keyword and apply it to new data later using the `apply`

command. See example below.

**Example**

... | apply model_DTR

You can inspect the decision tree learned by DecisionTreeRegressor with the `summary`

command (for example, `| summary model_DTR`

). Furthermore, you can get a JSON representation of the tree by giving `json=t`

as an argument to the `summary`

command (for example, `| summary model_DTR json=t`

). To specify the maximum depth of the tree to summarize, use the `limit`

argument (for example, `| summary model_DTC limit=10`

). The default value for the `limit`

argument is 5.

### ElasticNet

The ElasticNet algorithm uses the scikit-learn ElasticNet estimator to fit a model to predict the value of numeric fields. ElasticNet is a linear regression model that includes both L1 and L2 regularization (it is a generalization of Lasso and Ridge).

**Syntax**

fit ElasticNet <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>] [alpha=<int>] [l1_ratio=<int>]

**Example**

... | fit ElasticNet temperature from date_month date_hour normalize=true alpha=0.5 | ...

For descriptions of the `fit_intercept`

, `normalize`

, `alpha`

, and `l1_ratio`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html.

You can save ElasticNet models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply temperature_model

You can inspect the coefficients learned by ElasticNet with the `summary`

command. See example below.

**Example**

| summary temperature_model

### GradientBoostingRegressor

This algorithm uses the GradientBoostingRegressor algorithm from scikit-learn to build a regression model by fitting regression trees on the negative gradient of a loss function. For documentation on the parameters, see GradientBoostingRegressor from scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

**Syntax**

fit GradientBoostingRegressor <field_to_predict> from <explanatory_fields> [into <model_name>] [loss=<ls|lad|huber|quantile>] [max_features=<str>] [learning_rate=<float>] [min_weight_fraction_leaf=<float>] [alpha=<float>] [subsample=<float>] [n_estimators=<int>] [max_depth=<int>] [min_samples_split=<int>] [min_samples_leaf=<int>] [max_leaf_nodes=<int>] [random_state=<int>]

**Example**

The following example uses the GradientBoostingRegressor algorithm to fit a model and saves that model as ` temperature_model`

.

... | fit GradientBoostingRegressor temperature from date_month date_hour into temperature_model | ...

Use the `apply`

method to apply the trained model to the new data.

...|apply temperature_model

To inspect the features learned by GradientBoostingRegressor use the `summary`

command.

| summary temperature_model

### KernelRidge

The KernelRidge algorithm uses the scikit-learn KernelRidge algorithm to fit a model to predict numeric fields. This algorithm uses the radial basis function (rbf) kernel by default.

**Syntax**

fit KernelRidge <field_to_predict> from <explanatory_fields> [into <model_name>] [gamma=<float>]

**Example**

... | fit KernelRidge temperature from date_month date_hour into temperature_model | ...

The `gamma`

parameter controls the width of the rbf kernel (the default value is 1/*number of fields*). For details, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html.

You can save KernelRidge models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

... | apply sla_model

You cannot inspect the model learned by KernelRidge with the `summary`

command.

### Lasso

The Lasso algorithm uses the scikit-learn Lasso estimator to fit a model to predict the value of numeric fields. Lasso is like LinearRegression, but it uses L1 regularization to learn a linear models with fewer coefficients and smaller coefficients. Lasso models are consequently more robust to noise and resilient against overfitting.

**Syntax**

fit Lasso <field_to_predict> from <explanatory_fields> [into <model name>] [alpha=<float>] [fit_intercept=<true|false>] [normalize=<true|false>]

**Example**

... | fit Lasso temperature from date_month date_hour | ...

The `alpha`

parameter controls the degree of L1 regularization. The `fit_intercept`

parameter specifies whether the model should include an implicit intercept term (the default value is `true`

). For descriptions of the `alpha`

, `fit_intercept`

, and `normalize`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html.

You can save Lasso models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply temperature_model

You can inspect the coefficients learned by Lasso with the `summary`

command. See example below.

**Example**

| summary temperature_model

### LinearRegression

The LinearRegression algorithm uses the scikit-learn LinearRegression estimator to fit a model to predict the value of numeric fields.

**Syntax**

fit LinearRegression <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>]

**Example**

... | fit LinearRegression temperature from date_month date_hour into temperature_model | ..

The `fit_intercept`

parameter specifies whether the model should include an implicit intercept term (the default value is `true`

).

You can save LinearRegression models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply temperature_model

You can inspect the coefficients learned by LinearRegression with the `summary`

command. See example below.

**Example**

| summary temperature_model

### RandomForestRegressor

The RandomForestRegressor algorithm uses the scikit-learn RandomForestRegressor estimator to fit a model to predict the value of numeric fields.

**Syntax**

fit RandomForestRegressor <field_to_predict> from <explanatory_fields> [into <model name>] [n_estimators=<int>] [max_depth=<int>] [random_state=<int>] [max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]

**Example**

... | fit RandomForestRegressor temperature from date_month date_hour into temperature_model | ...

For descriptions of the `n_estimators`

, `random_state`

, `max_depth`

, `max_features`

, `min_samples_split`

, and `max_leaf_nodes`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

You can save RandomForestRegressor models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply temperature_model

You can list the features that were used to fit the model, as well as their relative importance or influence with the `summary`

command. See example below.

**Example**

| summary temperature_model

### Ridge

The Ridge algorithm uses the scikit-learn Ridge estimator to fit a model to predict the value of numeric fields. Ridge is like LinearRegression, but it uses L2 regularization to learn a linear models with smaller coefficients, making the algorithm more robust to collinearity.

**Syntax**

fit Ridge <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>] [alpha=<int>]

**Example**

... | fit Ridge temperature from date_month date_hour normalize=true alpha=0.5 | ...

The `alpha`

parameter specifies the degree of regularization (the default value is 1.0). For descriptions of the `fit_intercept`

, `normalize`

, and `alpha`

parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html.

You can save Ridge models using the `into`

keyword and apply new data later using the `apply`

command. See example below.

**Example**

... | apply temperature_model

You can inspect the coefficients learned by Ridge with the `summary`

command. See example below.

**Example**

| summary temperature_model

### SGDRegressor

The SGDRegressor algorithm uses the scikit-learn SGDRegressor estimator to fit a model to predict the value of numeric fields. This algorithm supports incremental fit.

**Syntax**

fit SGDRegressor <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<true|false>] [fit_intercept=<true|false>] [random_state=<int>] [n_iter=<int>] [l1_ratio=<float>] [alpha=<float>] [eta0=<float>] [power_t=<float>] [penalty=<l1|l2|elasticnet>] [learning_rate=<constant|optimal|invscaling>]

**Example**

... | fit SGDRegressor temperature from date_month date_hour into temperature_model | ...

SGDRegressor supports the following parameters:

`partial_fit=<true|false>`

: Specifies whether an existing model should be incrementally updated or not (default`false`

). See example below.

**Example**

The following example uses the`partial_fit`

parameter. In this example, if `My_Incremental_Model`

does not exist, the command saves the model data under the model name `My_Incremental_Model`

. If `My_Incremental_Model`

exists and was trained using SGDRegressor, the command updates the existing model with the new input. If `My_Incremental_Model`

exists but was not trained by SGDRegressor, an error message will be given.

| inputlookup server_power.csv | fit SGDRegressor "ac_power" from "total-cpu-utilization" "total-disk-accesses" partial_fit=true into My_Incremental_Model

Using `partial_fit=true`

on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If `partial_fit=false`

or `partial_fit`

is not specified (default is `false`

), the model specified is created and replaces the pre-trained model if one exists.

`fit_intercept=<true|false>`

: Whether the intercept should be estimated or not (default`true`

).`n_iter=<int>`

: The number of passes over the training data (aka epochs) (default 5). The number of iterations is set to 1 if using`partial_fit`

.`penalty=<l2|l1|elasticnet>`

: The penalty (aka regularization term) to be used (default l2).`learning_rate=<constant|optimal|invscaling>`

The learning rate. constant: eta = eta0, optimal: eta = 1.0/(alpha * t), invscaling: eta = eta0 / pow(t, power_t) (default`invscaling`

).`l1_ratio=<float>`

: The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1 (default 0.15). l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.`alpha=<float>`

: Constant that multiplies the regularization term (default 0.0001). Also used to compute learning_rate when set to`optimal`

.`eta0=<float>`

: The initial learning rate (default 0.01).`power_t=<float>`

: The exponent for inverse scaling learning rate (default 0.25).`random_state=<int>`

: The seed of the pseudo random number generator to use when shuffling the data.

You can save SGDRegressor models using the `into`

keyword and apply new data later using the `apply`

command. See the following example:

**Example**

... | apply temperature_model

You can inspect the coefficients learned by SGDRegressor with the `summary`

command. See the following example:

**Example**

| summary temperature_model

## Time Series Analysis

Forecasting algorithms, also known as time series analysis, provide methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data, and forecast its future values.

### ARIMA

The Autoregressive Integrated Moving Average (ARIMA) algorithm uses the StatsModels ARIMA algorithm to fit a model on a time series for better understanding and/or forecasting its future values. An ARIMA model can consist of autoregressive terms, moving average terms, and differencing operations. The autoregressive terms express the dependency of the current value of time series to its previous ones. The moving average terms model the effect of previous forecast errors (also called random shocks or white noise) on the current value. If the time series is non-stationary, differencing operations are used to make it stationary. A stationary process is a stochastic process that its probability distribution does not change over time.

**Syntax**

fit ARIMA [_time] <field_to_forecast> order=<int>-<int>-<int> [forecast_k=<int>] [conf_interval=<int>] [holdback=<int>]

**Example**

... | fit ARIMA Voltage order=4-0-1 holdback=10 forecast_k=10

ARIMA requires `order`

to be specified at fitting time. `order`

needs three values:

- Number of autoregressive (AR) parameters
- Number of differencing operations (D)
- Number of moving average (MA) Parameters

It also supports the following parameters:

`forecast_k=<int>`

: Tells ARIMA how many points into the future should be forecasted. If`_time`

is specified during fitting along with the`<field_to_forecast>`

, ARIMA will also generate the timestamps for forecasted values. By default,`forecast_k`

is zero.`conf_interval=<1..99>`

: This is the confidence interval in percentage around forecasted values. By default it is set to 95%.`holdback=<int>`

: This is the number of data points held back from the ARIMA model. This can be useful when you want to compare the forecast against known data points. By default, holdback is zero.

**Best Practices**

- It is highly recommended to send the time series through timechart before sending it into ARIMA to avoid non-uniform sampling time. If
`_time`

is not to be specified, using timechart is not necessary. - The time series should not have any gaps or missing data otherwise ARIMA will complain. If there are missing samples in the data, using a bigger span in timechart or using streamstats to fill in the gaps with average values can do the trick.
- ARIMA supports one time series at a time.
- ARIMA models cannot be saved and used at a later time in the current version.
- When chaining ARIMA output to another algorithm (i.e. ARIMA itself), keep in mind the length of the data is the length of the original data +
`forecast_k`

. If you want to maintain the`holdback`

position, you need to add the number in`forecast_k`

to your`holdback`

value.

See the StatsModels documentation at http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.arima_model.ARIMA.html for more information.

## Utility Algorithms

These utility algorithms are not machine learning algorithms, but provide methods to calculate data characteristics. These algorithms facilitate the process of algorithm selection and parameter selection.

### ACF (autocorrelation function)

ACF (autocorrelation function) calculates the correlation between a sequence and a shifted copy of itself, as a function of `shift`

. Shift is also referred to as lag or delay.

**Syntax**

fit ACF <field> [k=<int>] [fft=true|false] [conf_interval=<int>]

**Example**

... | fit ACF logins k=50 fft=true conf_interval=90

- The
`k`

parameter specifies the number of lags to return autocorrelation for. By default,`k`

is 40. - The
`fft`

parameter specifies whether ACF is computed via Fast Fourier Transform (FFT). By default,`fft`

is false. - The
`conf_interval`

parameter specifies the confidence interval in percentage to return. By default,`conf_interval`

is set to 95.

See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html for more information.

### PACF (partial autocorrelation function)

PACF (partial autocorrelation function) gives the partial correlation between a sequence and its lagged values, controlling for the values of lags that are shorter than its own.

**Syntax**

fit PACF <field> [k=<int>] [method=<ywunbiased|ywmle|ols>] [conf_interval=<int>]

**Example**

... | fit PACF logins k=20 conf_interval=90

- The
`k`

parameter specifies the number of lags to return partial autocorrelation for. By default,`k`

is 40 - The
`method`

parameter specifies which method for the calculation to use. By default,`method`

is unbiased. - The
`conf_interval`

parameter specifies the confidence interval in percentage to return. By default,`conf_interval`

is set to 95.

See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.pacf.html for more information.

PREVIOUS Using the score command |
NEXT Import a machine learning algorithm from Splunkbase |

This documentation applies to the following versions of Splunk^{®} Machine Learning Toolkit:
4.1.0

Feedback submitted, thanks!