Algorithms
The Splunk Machine Learning Toolkit supports the algorithms listed here. In addition to the examples included in the Splunk Machine Learning Toolkit, you can find more examples of these algorithms on the scikitlearn website.
The following algorithms use the fit
and apply
commands within the Splunk Machine Learning Toolkit. For information on the steps taken by these commands, please review the Understanding the fit and apply commands document.
Looking for information on using the score
command? Please navigate to the score command documentation for details.
MLSPL Quick Reference Guide
Download the Machine Learning Toolkit Quick Reference Guide for a handy cheat sheet of MLSPL commands and machine learning algorithms used in the Splunk Machine Learning Toolkit. This document is also available in Japanese.
MLSPL Performance App
Download the MLSPL Performance App for the Machine Learning Toolkit to use performance results for guidance and benchmarking purposes in your own environment.
Extend the algorithms you can use for your models
The algorithms listed here and in the MLSPL Quick Reference Guide are available natively in the Splunk Machine Learning Toolkit. You can also base your algorithm on over 300 open source Python algorithms from scikitlearn, pandas, statsmodel, numpy and scipy libraries available through the Python for Scientific Computing addon in Splunkbase.
For information on how to import an algorithm from the Python for Scientific Computing addon into the Splunk Machine Learning Toolkit, see the MLSPL API Guide.
Add algorithms through GitHub
Onprem customers looking for solutions that fall outside of the 30 native algorithms can use GitHub to add more algorithms. Solve custom uses cases through sharing and reusing algorithms in the Splunk Community for MLTK on GitHub. Here you can also learn about new machine learning algorithms from the Splunk open source community, and help fellow users of the toolkit.
Cloud customers can also use GitHub to add more algorithms via an app. The Splunk GitHub for Machine learning app provides access to custom algorithms and is based on the Machine Learning Toolkit open source repo. Cloud customers need to create a support ticket to have this app installed.
Anomaly Detection
Anomaly detection algorithms detect anomalies and outliers in numerical or categorical fields.
DensityFunction
The DensityFunction algorithm provides a consistent and streamlined workflow to create and store density functions and utilize them for anomaly detection. DensityFunction allows for grouping of the data using a by
clause, where for each group a separate density function is fitted and stored.
The DensityFunction algorithm supports the following three continuous probability density functions: Normal, Exponential, and Gaussian Kernel Density Estimation (Gaussian KDE).
Using the DensityFunction algorithm requires running version 1.4 of the Python for Scientific Computing addon.
The accuracy of the anomaly detection for DensityFunction depends on the quality and the size of the training dataset, how accurately the fitted distribution models the underlying process that generates the data, and the value chosen for the threshold
parameter.
Follow these guidelines to make your models perform more accurately:
 Aim for fitted distributions to have a cardinality (training dataset size) of at least 50. If you cannot collect more training data, create fewer groups of data using the
by
clause, giving you more datapoints per group.  The
threshold
parameter has a default value, but ideally the value forthreshold
,lower_threshold
, orupper_threshold
are chosen based on experimentation as guided by domain knowledge.  Continue tuning the
threshold
parameter until you are satisfied with the results.  Always inspect the model using the
summary
command.  If the distribution of the data changes through time, retrain your models frequently.
Parameters
 Valid values for the
dist
parameter include: norm (normal distribution), expon (exponential distribution), gaussian_kde (Gaussian KDE distribution), and auto (automatic selection).  The
dist
parameter default is auto. When set to auto, norm (normal distribution), expon (exponential distribution), and gaussian_kde (Gaussian KDE distribution) all run, with the best results returned.  The
metric
parameter calculates the distance between the sampled dataset from the density function and the training dataset.  Valid metrics for the
metric
parameter include: kolmogorov_smirnov and wasserstein.  The
metric
parameter default is wasserstein.  The
cardinality
value generated by thesummary
command represents the number of data points used when fitting the selected density function.  The
distance
value generated by thesummary
command represents the metric type used when calculating the distance as well as the distance between the sampled data points from the density function and the training dataset.  The
mean
value is the mean of the density function.  The value for
std
represents the standard deviation of the density function.  A value under
other
represents any parameters other thanmean
andstd
as applicable. In the case of Gaussian KDE,other
could show parameter size or bandwidth.  The
type
field shows both the chosen density function as well as if thedist
parameter is set to auto.  The
show_density
parameter default is False. If the parameter is set to True, the density of each data point will be provided as output in a new field calledProbabilityDensity
.  The output for
ProbabilityDensity
is the probability density of the data point according to the fitted probability density. This output is provided when theshow_density
parameter is set to True.  The
fit
command will fit a probability density function over the data, optionally store the resulting distribution's parameters in a model file, and output the outlier in a new field calledIsOutlier
.  The output for
IsOutlier
is a list of labels. Number 1 represents outliers, and 0 represents inliers, assigned to each data point. Outliers are detected based on the values set for thethreshold
parameter. Inspect theIsOutlier
results column to see how well the outlier detection is performing.  The parameters
threshold
,lower_threshold
, andupper_threshold
control the outlier detection process.  The
threshold
parameter is the center of the outlier detection process. It represents the percentage of the area under the density function and has a value between 0.000000001 (refers to ~0%) and 1 (refers to 100%). Thethreshold
parameter guides the DensityFunction algorithm to mark outlier areas on the fitted distribution. For example, ifthreshold=0.01
, then 1% of the fitted density function will be set as the outlier area.  The
threshold
parameter default value is 0.01.  The output for
BoundaryRanges
is the boundary ranges of outliers on the density function which are set according to the values of thethreshold
parameter.  Each boundary region has three values: boundary opening point, boundary closing point, and percentage of boundary region.
 The syntax follows the convention of
[[first_boundary_region], [second_boundary_region],… ,[n_th boundary_region]]
.  In cases of a single boundary region, the value for the percentage of boundary region is equal to the
threshold
parameter value.
 The syntax follows the convention of

BoundaryRanges
is calculated as an approximation and will be empty in the following two cases: Where the density function has a sharp peak from low standard deviation.
 When there are a low number of data points.
 Data points that are exactly at the boundary opening or closing point are assigned as inliers. An opening or closing point is determined by the density function in use.
 Normal density function has left and right boundary regions. Data points on the left of the left boundary closing point, and data points on the right of the right boundary opening point are assigned as outliers.
 Exponential density function has one boundary region. Data points on the right of the right boundary opening point are assigned as outliers.
 Gaussian KDE density function can have one or more boundary regions, depending on the number of peaks and dips within the density function. Data points in these boundary regions are assigned as outliers. In cases of boundary regions to the left or right, guidelines from Normal density function apply. As the shape for Gaussian KDE density function can differ from dataset to dataset, you do not consistently observe left and right boundary regions.
Syntax
 fit DensityFunction <field> [by "<field1>[,<field2>,....<field5>]"] [into <model name>] [dist=<str>] [show_density=truefalse] [threshold=<float>lower_threshold=<float>upper_threshold=<float>] [metric=<str>]
You can apply the saved model to new data with the apply
command, with the option to update the parameters for threshold
, lower_threshold
, upper_threshold
, and show_density
. Parameters for dist
and metric
cannot be applied at this stage, and any new values provided will be ignored.
 apply <model name> [threshold=<float>lower_threshold=<float>upper_threshold=<float>] [show_density=truefalse]
You can inspect the model learned by DensityFunction with the summary
command.
 summary <model name>
Syntax constraints
 Fields within the
by
clause must be given in quotation marks.  The maximum number of fields within the
by
clause is 5.  The total number of groups calculated with the
by
clause can not exceed 1024. In an example clause ofby "DayOfWeek,HourOfDay"
there are two fields: one forDayOfWeek
and one forHourOfDay
. As there are seven days in a week, there are seven groups forDayOfWeek
. As there are twentyfour hours in a day, there are twentyfour groups forHourOfDay
. Meaning the total number of groups calculated with the by clause is7*24= 168
. The limited number of groups prevents model files from growing too large. You can increase the limit by changing the value of
max_groups
in the DensityFunction settings. Larger limits mean larger model files and longer load times when runningapply
.  Decrease
max_kde_parameter_size
to allow for the increase ofmax_groups
. This change keeps model sizes small while allowing for increased groups.
 The limited number of groups prevents model files from growing too large. You can increase the limit by changing the value of
 The parameters
threshold
,lower_threshold
, andupper_threshold
must be within the range of 0.00000001 to 1.  If the parameters of
lower_threshold
andupper_threshold
are both provided, the summation of these parameters must be less than 1 (100%).  The
threshold
andlower_threshold
/upper_threshold
parameters can not be specified together.  Exponential density function and Gaussian KDE density function only support the
threshold
.  Exponential density function and Gaussian KDE density function do not support
lower_threshold
orupper_threshold
.  Normal density function supports either
threshold
orlower_threshold
/upper_threshold
.  The parameters
lower_threshold
andupper_threshold
can be used with only Normal density function.
Examples
The following example shows DensityFunction on a dataset with the fit
command.
 inputlookup call_center.csv  eval _time=strptime(_time, "%Y%m%dT%H:%M:%S")  bin _time span=15m  eval HourOfDay=strftime(_time, "%H")  eval BucketMinuteOfHour=strftime(_time, "%M")  eval DayOfWeek=strftime(_time, "%A")  stats max(count) as Actual by HourOfDay,BucketMinuteOfHour,DayOfWeek,source,_time  fit DensityFunction Actual by "HourOfDay,BucketMinuteOfHour,DayOfWeek" into mymodel
The following example shows DensityFunction on a dataset with the apply
command.
 inputlookup call_center.csv  eval _time=strptime(_time, "%Y%m%dT%H:%M:%S")  bin _time span=15m  eval HourOfDay=strftime(_time, "%H")  eval BucketMinuteOfHour=strftime(_time, "%M")  eval DayOfWeek=strftime(_time, "%A")  stats max(count) as Actual by HourOfDay,BucketMinuteOfHour,DayOfWeek,source,_time  apply mymodel threshold=0.02
The following example shows DensityFunction on a dataset with the summary
command.
 summary mymodel
The following example shows BoundaryRages
on a test set. In this example the threshold is set to 30% (0.3). The first row has a left boundary range which starts at Infinity and goes up to the number 44.6912. The area of the left boundary range is 15% of the total area under the density function. It has also a right boundary range which starts at a number 518.3088 and goes up to Infinity. Again, the area of the right boundary range is the same as the left boundary range with 15% of the total area under the density function. The areas of right and left boundary ranges add up to the threshold value of 30%. The third row has only one boundary range which starts at number 300.0943 and goes up to Infinity. The area of the boundary range is 30% of the area under the density function.
 inputlookup call_center.csv  eval _time=strptime(_time, "%Y%m%dT%H:%M:%S")  bin _time span=15m  eval HourOfDay=strftime(_time, "%H")  eval BucketMinuteOfHour=strftime(_time, "%M")  eval DayOfWeek=strftime(_time, "%A")  stats max(count) as Actual by HourOfDay, BucketMinuteOfHour, DayOfWeek, source, _time  fit DensityFunction Actual by "HourOfDay, BucketMinuteOfHour, DayOfWeek" threshold=0.3 into mymodel
LocalOutlierFactor
The LocalOutlierFactor algorithm uses the scikitlearn Local Outlier Factor (LOF) to measure the local deviation of density of a given sample with respect to its neighbors. LocalOutlierFactor is an unsupervised outlier detection method. The anomaly score depends on how isolated the object is with respect to its neighbors.
For descriptions of the n_neighbors
, leaf_size
and other parameters, see the scikit learn documentation: http://scikitlearn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html
Using the LocalOutlierFactor algorithm requires running version 1.3 or above of the Python for Scientific Computing addon.
Parameters
 The
anomaly_score
parameter default is True. Disable this default by adding theFalse
keyword to the command.  The
n_neighbors
parameter default is 20  The
leaf_size
parameter default is 30  The
p
parameter is limited top >=1
 The
contamination
parameter must be within the range of 0.0 (not included) to 0.5 (included)  The
contamination
parameter default is 0.1  Options for the
algorithm
parameter include: brute, kd_tree, ball_tree, and auto. The default is auto.  The brute, kd_tree, ball_tree, and auto
algorithm
options have respective valid metrics. The defaultmetric
for each is algorithm is minkowski. Valid metrics for brute include: cityblock, euclidean, l1, l2, manhattan, chebyshev, minkowski, braycurtis, canberra, dice, hamming, jaccard, kulsinski, matching, rogerstanimoto, russellrao, sokalmichener, sokalsneath, cosine, correlation, sqeuclidean, and yule.
 Valid metrics for kd_tree include: cityblock, euclidean, l1, l2, manhattan, chebyshev, and minkowski.
 Valid metrics for ball_tree include: cityblock, euclidean, l1, l2, manhattan, chebyshev, minkowski, braycurtis, canberra, dice, hamming, jaccard, kulsinski, matching, rogerstanimoto, russellrao, sokalmichener, and sokalsneath.
 The output for LocalOutlierFactor is a list of labels titled
is_outlier
, assigned1
for outliers, and1
for inliers
Syntax
fit LocalOutlierFactor <fields> [n_neighbors=<int>] [leaf_size=<int>] [p=<int>] [contamination=<float>] [metric=<str>] [algorithm=<str>] [anomaly_score=<truefalse>]
Syntax constraints
 You cannot save LocalOutlierFactor models using the
into
keyword. This algorithm does not support saving models.  LocalOutlierFactor does not include the
predict
method.
Example
The following example uses LocalOutlierFactor on a test set.
 inputlookup iris.csv  fit LocalOutlierFactor petal_length petal_width n_neighbors=10 algorithm=kd_tree metric=minkowski p=1 contamination=0.14 leaf_size=10
OneClassSVM
The OneClassSVM algorithm uses the scikitlearn OneClassSVM to fit a model from a set of features or fields for detecting anomalies and outliers, where features are expected to contain numerical values. OneClassSVM is an unsupervised outlier detection method.
For further information, see the scikit learn documentation: http://scikitlearn.org/stable/modules/svm.html#kernelfunctions
Parameters
 The
kernel
parameter specifies the kernel type for using in the algorithm, where the default value is kernel isrbf
. Kernel types include: linear, rbf, poly, and sigmoid.
 You can specify the upper bound on the fraction of training error as well as the lower bound of the fraction of support vectors using the
nu
parameter, where the default value is 0.5.  The
degree
parameter is ignored by all kernels except the polynomial kernel, where the default value is 3. 
gamma
is the kernel coefficient that specifies how much influence a single data instance has, where the default value is1/ number of features
.  The independent term of
coef0
in the kernel function is only significant if you have polynomial or sigmoid function.  The term
tol
is the tolerance for stopping criteria.  The
shrinking
parameter determines whether to use the shrinking heuristic.
Syntax
fit OneClassSVM <fields> [into <model name>] [kernel=<str>] [nu=<float>] [coef0=<float>] [gamma=<float>] [tol=<float>] [degree=<int>] [shrinking=<truefalse>]
 You can save OneClassSVM models using the
into
keyword.  You can apply the saved model later to new data with the
apply
command.
Syntax constraints
 After running the
fit
orapply
command, a new field namedisNormal
is generated. This field defines whether a particular record (row) is normal (isNormal=1
) or anomalous (isNormal=1
).  You cannot inspect the model learned by OneClassSVM with the
summary
command.
Example
The following example uses OneClassSVM on a test set.
...  fit OneClassSVM * kernel="poly" nu=0.5 coef0=0.5 gamma=0.5 tol=1 degree=3 shrinking=f into TESTMODEL_OneClassSVM
Classifiers
Classifier algorithms predict the value of a categorical field.
The kfold
crossvalidation command can be used with all Classifier algorithms. Learn more here.
BernoulliNB
The BernoulliNB algorithm uses the scikitlearn BernoulliNB estimator to fit a model to predict the value of categorical fields where explanatory variables are assumed to be binaryvalued. BernoulliNB is an implementation of the Naive Bayes classification algorithm. This algorithm supports incremental fit.
Parameters
 The
alpha
parameter controls Laplace/ Lidstone smoothing. The default value is 1.0.  The
binarize
parameter is a threshold that can be used for converting numeric field values to the binary values expected by BernoulliNB. The default value is 0. If
binarize=0
is specified, the default, values > 0 are assumed to be 1, and values <= 0 are assumed to be 0.
 If
 The
fit_prior
Boolean parameter specifies whether to learn class prior probabilities. The default value is True. Iffit_prior=f
is specified, classes are assumed to have uniform popularity.
Syntax
fit BernoulliNB <field_to_predict> from <explanatory_fields> [into <model name>] [alpha=<float>] [binarize=<float>] [fit_prior=<truefalse>] [partial_fit=<truefalse>]
You can save BernoulliNB models using the into
keyword and apply the saved model later to new data using the apply
command.
...  apply TESTMODEL_BernoulliNB
You can inspect the model learned by BernoulliNB with the summary
command as well as view the class and log probability information as calculated by the dataset.
....  summary My_Incremental_Model
Syntax constraints
 The
partial_fit
parameter controls whether an existing model should be incrementally updated or not. The default value isFalse
, meaning it will not be incrementally updated. Choosingpartial_fit=True
allows you to update an existing model using only new data without having to retrain it on the full training data set.  Using
partial_fit=True
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. Ifpartial_fit=False
orpartial_fit
is not specified (default is False), the model specified is created and replaces the pretrained model if one exists.  If
My_Incremental_Model
does not exist, the command saves the model data under the model nameMy_Incremental_Model
. IfMy_Incremental_Model
exists and was trained using BernoulliNB, the command updates the existing model with the new input. IfMy_Incremental_Model
exists but was not trained by BernoulliNB, an error message displays.
Example
The following example uses BernoulliNB on a test set.
...  fit BernoulliNB type from * into TESTMODEL_BernoulliNB alpha=0.5 binarize=0 fit_prior=f
DecisionTreeClassifier
The DecisionTreeClassifier algorithm uses the scikitlearn DecisionTreeClassifier estimator to fit a model to predict the value of categorical fields. For further information, see the scikit learn documentation: http://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.
Parameters
To specify the maximum depth of the tree to summarize, use the limit
argument. The default value for the limit
argument is 5.
...  summary model_DTC limit=10
Syntax
fit DecisionTreeClassifier <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<int>] [max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>] [criterion=<ginientropy>] [splitter=<bestrandom>] [random_state=<int>]
You can save DecisionTreeClassifier models by using the into
keyword and apply it to new data later by using the apply
command.
...  apply model_DTC
You can inspect the decision tree learned by DecisionTreeClassifier with the summary
command.
...  summary model_DTC
See a JSON representation of the tree by giving json=t
as an argument to the summary
command.
...  summary model_DTC json=t
Example
The following example uses DecisionTreeClassifier on a test set.
...  fit DecisionTreeClassifier SLA_violation from * into sla_model  ...
GaussianNB
The GaussianNB algorithm uses the scikitlearn GaussianNB estimator to fit a model to predict the value of categorical fields, where the likelihood of explanatory variables is assumed to be Gaussian. GaussianNB is an implementation of Gaussian Naive Bayes classification algorithm. This algorithm supports incremental fit.
Parameters
 The
partial_fit
parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.  The
partial_fit
parameter default is False.
Syntax
fit GaussianNB <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<truefalse>]
You can save GaussianNB models using the into
keyword and apply the saved model later to new data using the apply
command.
...  apply TESTMODEL_GaussianNB
You can inspect models learned by GaussianNB with the summary
command.
...  summary My_Incremental_Model
Syntax constraints
 If
My_Incremental_Model
does not exist, the command saves the model data under the model nameMy_Incremental_Model
. IfMy_Incremental_Model
exists and was trained using GaussianNB, the command updates the existing model with the new input. IfMy_Incremental_Model
exists but was not trained by GaussianNB, an error message is thrown.  If
partial_fit=False
orpartial_fit
is not specified the model specified is created and replaces the pretrained model if one exists.
Example
The following example uses GaussianNB on a test set.
...  fit GaussianNB species from * into TESTMODEL_GaussianNB
The following example includes the partial_fit
command.
 inputlookup iris.csv  fit GaussianNB species from * partial_fit=true into My_Incremental_Model
GradientBoostingClassifier
This algorithm uses the GradientBoostingClassifier from scikitlearn to build a classification model by fitting regression trees on the negative gradient of a deviance loss function. For further information, see the scikit learn documentation: http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.
Syntax
fit GradientBoostingClassifier <field_to_predict> from <explanatory_fields>[into <model name>] [loss=<deviance  exponential>] [max_features=<str>] [learning_rate =<float>] [min_weight_fraction_leaf=<float>] [n_estimators=<int>] [max_depth=<int>] [min_samples_split =<int>] [min_samples_leaf=<int>] [max_leaf_nodes=<int>] [random_state=<int>]
You can apply the saved model later to new data using the apply
command.
...  apply TESTMODEL_GradientBoostingClassifier
You can inspect features learned by GradientBoostingClassifier with the summary
command.
...  summary TESTMODEL_GradientBoostingClassifier
Example
The following example uses GradientBoostingClassifier on a test set.
...  fit GradientBoostingClassifier target from * into TESTMODEL_GradientBoostingClassifier
LogisticRegression
The LogisticRegression algorithm uses the scikitlearn LogisticRegression estimator to fit a model to predict the value of categorical fields.
Parameters
 The
fit_intercept
parameter specifies whether the model includes an implicit intercept term.  The default value of the
fit_intercept
parameter is True.  The
probabilities
parameter specifies whether probabilities for each possible field value should be returned alongside the predicted value.  The default value of the
probabilities
parameter is False.
Syntax
fit LogisticRegression <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<truefalse>] [probabilities=<truefalse>]
You can save LogisticRegression models using the into
keyword and apply new data later using the apply
command.
...  apply sla_model
You can inspect the coefficients learned by LogisticRegression with the summary
command.
...  summary sla_model
Example
The following examples uses LogisticRegression on a test set.
...  fit LogisticRegression SLA_violation from IO_wait_time into sla_model  ...
MLPClassifier
The MPLClassifier algorithm uses the scikitlearn Multilayer Perceptron estimator for classification. MLPClassifier uses a feedforward artificial neural network model that trains using backpropagation. This algorithm supports incremental fit.
For descriptions of the batch_size
, random_state
and max_iter
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
Using the MLPClassifier algorithm requires running version 1.3 or above of the Python for Scientific Computing addon.
Parameters
 The
partial_fit
parameter controls whether an existing model should be incrementally updated on not. This allows you to update nan existing model using only new data without having to retrain it on the full training data set.  The
partial_fit
parameter default is False.  The
hidden_layer_sizes
parameter format (int
) varies based on the number of hidden layers in the data.
Syntax
fit MLPClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [batch_size=<int>] [max_iter=<int>] [random_state=<int>] [hidden_layer_sizes=<int><int><int>] [activation=<str>] [solver=<str>] [learning_rate=<str>] [tol=<float>} {momentum=<float>]
You can save MLPClassifier models by using the into
keyword and apply it to new data later by using the apply
command.
You can inspect models learned by MLPClassifier with the summary
command.
...  summary My_Example_Model
Syntax constraints
 If
My_Example_Model
does not exist, the model is saved to it.  If
My_Example_Model
exists and was trained using MLPClassifier, the command updates the existing model with the new input.  If
My_Example_Model
exists but was not trained using MLPClassifier, an error message displays.
Example
The following example uses MLPClassifier on a test set.
...  inputlookup diabetes.csv  fit MLPClassifier response from * into MLP_example_model hidden_layer_sizes='10010080' ...
The following example includes the partial_fit
command.
 inputlookup iris.csv  fit MLPClassifier species from * partial_fit=true into My_Example_Model
RandomForestClassifier
The RandomForestClassifier algorithm uses the scikitlearn RandomForestClassifier estimator to fit a model to predict the value of categorical fields.
For descriptions of the n_estimators
, max_depth
, criterion
, random_state
, max_features
, min_samples_split
, and max_leaf_nodes
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
Syntax
fit RandomForestClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [n_estimators=<int>] [max_depth=<int>] [criterion=<gini  entropy>] [random_state=<int>] [max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]
You can save RandomForestClassifier models using the into
keyword and apply new data later using the apply
command.
...  apply sla_model
You can list the features that were used to fit the model, as well as their relative importance or influence with the summary
command.
...  summary sla_model
Example
The following example uses RandomForestClassifier on a test set.
...  fit RandomForestClassifier SLA_violation from * into sla_model  ...
SGDClassifier
The SGDClassifier algorithm uses the scikitlearn SGDClassifier estimator to fit a model to predict the value of categorical fields. This algorithm supports incremental fit.
Parameters
 The
partial_fit
parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.  The
partial_fit
parameter default is False. n_iter=<int>
is the number of passes over the training data also known as epochs. The default is 5. The number of iterations is set to 1 if usingpartial_fit
. The
loss=<hingelogmodified_hubersquared_hingeperceptron>
parameter is the loss function to be used. Defaults to
hinge
, which gives a linear SVM.
 Defaults to
 The
log
loss gives logistic regression, a probabilistic classifier. modified_huber
is another smooth loss that brings tolerance to outliers as well as probability estimates.squared_hinge
is like hinge but is quadratically penalized.
perceptron
is the linear loss used by the perceptron algorithm.  The
fit_intercept=<truefalse>
parameter specifies whether the intercept should be estimated or not. The default is True. penalty=<l2l1elasticnet>
is the penalty, also known as regularization term, to be used. The default is l2.learning_rate=<constantoptimalinvscaling>
is the learning rate.constant
: eta = eta0
optimal
: eta = 1.0/(alpha * t) 
invscaling
: eta = eta0 / pow(t, power_t)  The default is
invscaling
l1_ratio=<float>
is the Elastic Net mixing parameter, with 0 <= l1_ratio <= 1 (default 0.15). l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.
alpha=<float>
is the constant that multiplies the regularization term (default 0.0001). Also used to compute learning_rate when set tooptimal
.eta0=<float>
is the initial learning rate. The default is 0.01.power_t=<float>
is the exponent for inverse scaling learning rate. The default is 0.25.random_state=<int>
is the seed of the pseudo random number generator to use when shuffling the data.
Syntax
fit SGDClassifier <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<truefalse>] [loss=<hingelogmodified_hubersquared_hingeperceptron>] [fit_intercept=<truefalse>] [random_state=<int>] [n_iter=<int>] [l1_ratio=<float>] [alpha=<float>] [eta0=<float>] [power_t=<float>] [penalty=<l1l2elasticnet>] [learning_rate=<constantoptimalinvscaling>]
You can save SGDClassifier models using the into
keyword and apply the saved model later to new data using the apply
command.
...  apply sla_model
You can inspect the model learned by SGDClassifier with the summary
command.
...  summary sla_model
Syntax constraints
 If
My_Incremental_Model
does not exist, the command saves the model data under the model nameMy_Incremental_Model
.  If
My_Incremental_Model
exists and was trained using SGDClassifier, the command updates the existing model with the new input.  If
My_Incremental_Model
exists but was not trained by SGDClassifier, an error displays.  Using
partial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead.  If
partial_fit=false
orpartial_fit
is not specified the model specified is created and replaces the pretrained model if one exists.
Example
The following example uses SGDClassifier on a test set.
...  fit SGDClassifier SLA_violation from * into sla_model
The following example includes the partial_fit=<truefalse>
command.
partial_fit  inputlookup iris.csv  fit SGDClassifier species from * partial_fit=true into My_Incremental_Model
SVM
The SVM algorithm uses the scikitlearn kernelbased SVC estimator to fit a model to predict the value of categorical fields. It uses the radial basis function (rbf) kernel by default. For descriptions of the C
and gamma
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.svm.SVC.html.
Kernelbased methods such as the scikitlearn SVC tend to work best when the data is scaled, for example, using our StandardScaler algorithm:
 fit StandardScaler
. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
Parameters
 The
gamma
parameter controls the width of the rbf kernel. The default value is1 /number of fields
.  The
C
parameter controls the degree of regularization when fitting the model. The default value is 1.0.
Syntax
fit SVM <field_to_predict> from <explanatory_fields> [into <model name>] [C=<float>] [gamma=<float>]
You can save SVM models using the into
keyword and apply new data later using the apply
command.
...  apply sla_model
Syntax constraints
You cannot inspect the model learned by SVM with the summary
command.
Example
The following example uses SVM on a test set.
...  fit SVM SLA_violation from * into sla_model  ...
Clustering Algorithms
Clustering is the grouping of data points. Results will vary depending upon the clustering algorithm used. Clustering algorithms differ in how they determine if data points are similar and should be grouped. For example, the Kmeans algorithm clusters based on points in space, whereas the DBSCAN algorithm clusters based on local density.
BIRCH
The BIRCH algorithm uses the scikitlearn BIRCH clustering algorithm to divide data points into set of distinct clusters. The cluster for each event is set in a new field named cluster
. This algorithm supports incremental fit.
Parameters
 The
k
parameter specifies the number of clusters to divide the data into after the final clustering step, which treats the subclusters from the leaves of the CF tree as new samples. By default, the cluster label field name is
cluster
. Change that behavior by using theas
keyword to specify a different field name.
 By default, the cluster label field name is
 The
partial_fit
parameter controls whether an existing model should be incrementally updated on not. This allows you to update nan existing model using only new data without having to retrain it on the full training data set.  The
partial_fit
parameter default is False.
Syntax
fit BIRCH <fields> [into <model name>] [k=<int>][partial_fit=<truefalse>] [into <model name>]
You can save BIRCH models using the into
keyword and apply new data later using the apply
command.
...  apply BIRCH_model
Syntax constraints
 If
My_Incremental_Model
does not exist, the command saves the model data under the model nameMy_Incremental_Model
.  If
My_Incremental_Model
exists and was trained using BIRCH, the command updates the existing model with the new input.  If
My_Incremental_Model
exists but was not trained by BIRCH, an error message displays.  Using
partial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead.  If
partial_fit=false
orpartial_fit
is not specified the model specified is created and replaces the pretrained model if one exists.  You cannot inspect the model learned by BIRCH with the
summary
command.
Examples
The following example uses BIRCH on a test set.
...  fit BIRCH * k=3  stats count by cluster
The following example includes the partial_fit
command.
 inputlookup track_day.csv  fit BIRCH * k=6 partial_fit=true into My_Incremental_Model
DBSCAN
The DBSCAN algorithm uses the scikitlearn DBSCAN clustering algorithm to divide a result set into distinct clusters. The cluster for each event is set in a new field named cluster
. DBSCAN is distinct from KMeans in that it clusters results based on local density, and uncovers a variable number of clusters, whereas KMeans finds a precise number of clusters. For example, k=5
finds 5 clusters.
Parameters
 The
eps
parameter specifies the maximum distance between two samples for them to be considered in the same cluster. By default, the cluster label field name is
cluster
. Change that behavior by using theas
keyword to specify a different field name.
 By default, the cluster label field name is
 The
min_samples
parameter defines the number of samples, or the total weight, in a neighborhood for a point to be considered as a core point  including the point itself. You can choose themin_samples
parameter's best value based on preference for cluster density or noise in your dataset.  The
min_samples
parameter is optional.  The
min_samples
default value is 5.  The minimum value for the
min_samples
parameter is 3.  If
min_samples=8
you need at least 8 data points to form a dense cluster.
If you choose the min_samples
parameter's best value based on noise in your dataset, it's recommended to have a larger data set to pull from.
Syntax
 fit DBSCAN <fields> [eps=<number>] [min_samples=<integer>]
Syntax constraints
You cannot save DBSCAN models using the into
keyword. To predict cluster assignments for future data, combine the DBSCAN algorithm with any classifier algorithm. For example, first cluster the data using DBSCAN, then fit RandomForestClassifier to predict the cluster.
Examples
The following example uses DBSCAN without the min_samples
parameter.
...  fit DBSCAN *  stats count by cluster
The following example uses DBSCAN with the min_samples
parameter.
... inputlookup track_day.csv  fit DBSCAN eps=0.5 min_samples=1000 speed  table speed cluster
Kmeans
Kmeans clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable k
. The Kmeans algorithm uses the scikitlearn Kmeans implementation. The cluster for each event is set in a new field named cluster
. Use the Kmeans algorithm when you have unlabeled data and have at least approximate knowledge of the total number of groups into which the data can be divided.
Using the Kmeans algorithm has the following advantages:
 Computationally faster than most other clustering algorithms.
 Simple algorithm to explain and understand.
 Normally produces tighter clusters than hierarchical clustering.
Using the Kmeans algorithm has the following disadvantages:
 Difficult to determine optimal or true value of
k
. See Xmeans.  Sensitive to scaling. See StandardScaler.
 Each clustering may be slightly different, unless you specify the
random_state
parameter.  Does not work well with clusters of different sizes and density.
For descriptions of default value of K
, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Parameters
The k
parameter specifies the number of clusters to divide the data into. By default, the cluster label field name is cluster
. Change that behavior by using the as
keyword to specify a different field name.
Syntax
fit KMeans <fields> [into <model name>] [k=<int>] [random_state=<int>]
You can save Kmeans models using the into
keyword when using the fit
command.
You can apply the model to new data using the apply
command.
...  apply cluster_model
You can inspect the model using the summary
command.
...  summary cluster_model
Example
The following example uses Kmeans on a test set.
...  fit KMeans * k=3  stats count by cluster
SpectralClustering
The SpectralClustering algorithm uses the scikitlearn SpectralClustering clustering algorithm to divide a result set into set of distinct clusters. SpectralClustering first transforms the input data using the Radial Basis Function (rbf) kernel, and then performs KMeans clustering on the result. Consequently, SpectralClustering can learn clusters with a nonconvex shape. The cluster for each event is set in a new field named cluster
.
Parameters
The k
parameter specifies the number of clusters to divide the data into after kernel step. By default, the cluster label field name is cluster
. Change that behavior by using the as
keyword to specify a different field name.
Syntax
fit SpectralClustering <fields> [k=<int>] [gamma=<float>] [random_state=<int>]
Syntax constraints
You cannot save SpectralClustering models using the into
keyword. If you want to be able to predict cluster assignments for future data, you can combine the SpectralClustering algorithm with any clustering algorithm. For example, first cluster the data using SpectralClustering, then fit a classifier to predict the cluster using RandomForestClassifier.
Example
The following example uses SpectralClustering on a test set.
...  fit SpectralClustering * k=3  stats count by cluster
Xmeans
Use the Xmeans algorithm when you have unlabeled data and no prior knowledge of the total number of labels into which that data could be divided. The Xmeans clustering algorithm is an extended Kmeans that automatically determines the number of clusters based on Bayesian Information Criterion (BIC) scores. Starting with a single cluster, the Xmeans algorithm goes into action after each run of Kmeans, making local decisions about which subset of the current centroids should split themselves in order to fit the data better.
Using the Xmeans algorithm has the following advantages:
 Eliminates the requirement of having to provide the value of
k
.  Normally produces tighter clusters than hierarchical clustering.
Using the Xmeans algorithm has the following disadvantages:
 Sensitive to scaling. See StandardScaler.
 Different initializations might result in different final clusters.
 Does not work well with clusters of different sizes and density.
Parameters
 The
k
is the total number of labels/clusters in the data.  The splitting decision is done by computing the BIC.
 The cluster for each event is set in a new field named cluster, and the total number of clusters is set in a new field named
n_clusters
.  By default, the cluster label field name is
cluster
. Change that behavior by using theas
keyword to specify a different field name.
Syntax
fit XMeans <fields> [into <model name>]
You can apply new data to the saved Xmeans model using the apply
command.
...  apply cluster_model
You can save Xmeans models using the into
command. You can inspect the model learned by Xmeans with the summary
command.
... summary cluster_model
Example
The following example uses Xmeans on a test set.
...  fit XMeans *  stats count by cluster
Crossvalidation
Crossvalidation assesses how well a statistical model generalizes on an independent dataset. Crossvalidation tells you how well your machine learning model is expected to perform on data that it has not been trained on. There are many types of crossvalidation, but Kfold crossvalidation (kfold_cv
) is one of the most common.
Crossvalidation is typically used for the following machine learning scenarios:
 Comparing two or more algorithms against each other for selecting the best choice on a particular dataset.
 Comparing different choices of hyperparameters on the same algorithm for choosing the best hyperparameters for a particular dataset.
 An improved method over a train/test split for quantifying model generalization.
Crossvalidation is not well suited for timeseries charts:
 In situations where the data is ordered such as timeseries, crossvalidation is not well suited because the training data is shuffled. In these situations, other methods such as Forward Chaining are more suitable.
 The most straightforward implementation is to wrap sklearn's Time Series Split. Learn more here: https://en.wikipedia.org/wiki/Forward_chaining
Kfold crossvalidation
In the kfold_cv
parameter, the training set is randomly partitioned into k equalsized subsamples. Then, each subsample takes a turn at becoming the validation (test) set, predicted by the other k1 training sets. Each sample is used exactly once in the validation set, and the variance of the resulting estimate is reduced as k is increased. The disadvantage of the kfold_cv
parameter is that k different models have to be trained, leading to long execution times for large datasets and complex models.
The scores obtained from Kfold crossvalidation are generally a less biased and less optimistic estimate of the model performance than a standard training and testing split.
You can obtain k performance metrics, one for each training and testing split. These k performance metrics can then be averaged to obtain a single estimate of how well the model generalizes on unseen data.
Syntax
The kfold_cv
parameter is applicable to to all classification and regression algorithms, and you can append the command to the end of an SPL search.
Here kfold_cv=<int>
specifies that k=<int>
folds is used. When you specify a classification algorithm, stratified kfold is used instead of kfold. In stratified kfold, each fold contains approximately the same percentage of samples for each class.
.. fit <classification  regression algo> <targetVariable> from <featureVariables> [options] kfold_cv=<int>
The kfold_cv
parameter cannot be used when saving a model.
Output
The kfold_cv
parameter returns performance metrics on each fold using the same model specified in the SPL  including algorithm and hyper parameters. Its only function is to give you insight into how well you model generalizes. It does not perform any model selection or hyper parameter tuning.
Examples
The first example shows the kfold_cv
parameter used in classification. Where the output is a set of metrics for each fold including accuracy, f1_weighted, precision_weighted, and recall_weighted.
This second example shows the kfold_cv
parameter used in classification. Where the output is a set of metrics for each the neg_mean_squared_error and r^2 folds.
Feature Extraction
Feature extraction algorithms transform fields for better prediction accuracy.
FieldSelector
The FieldSelector algorithm uses the scikitlearn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests. For descriptions of the mode
and param
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html.
Parameters
The type
parameter specifies if the field to predict is categorical or numeric.
Syntax
fit FieldSelector <field_to_predict> from <explanatory_fields> [into <model name>] [type=<categorical, numeric>] [mode=<k_best, fpr, fdr, fwe, percentile>] [param=<int>]
You can save FieldSelector models using the into
keyword and apply new data later using the apply
command.
...  apply sla_model
You can inspect the model learned by FieldSelector with the summary
command.
 summary sla_model
Example
The following example uses FieldSelector on a test set.
...  fit FieldSelector type=categorical SLA_violation from * into sla_model  ...
HashingVectorizer
The HashingVectorizer algorithm converts text documents to a matrix of token occurrences. It uses a feature hashing strategy to allow for hash collisions when measuring the occurrence of tokens. It is a stateless transformer, meaning that it does not require building a vocabulary of the seen tokens. This reduces the memory footprint and allows for larger feature spaces.
HashingVectorizer is comparable with the TFIDF algorithm, as they share many of the same parameters. However HashingVectorizer is a better option for building models with large text fields provided you do not need to know term frequencies, and only want outcomes.
For descriptions of the max_features
, max_df
, min_df
, ngram_range
, analyzer
, norm
, and token_pattern
parameters, see the scikitlearn documentation at https://scikitlearn.org/0.19/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Parameters
 The
reduce
parameter is eitherTrue
orFalse
and determines whether or not to reduce the output to a smaller dimension using TruncatedSVD.  The
reduce
parameter default is True.  The
k=<int>
parameter sets the number of dimensions to reduce when thereduce
parameter is set totrue
. Default is 100.  The default for the
max_features
parameter is 10,000.  The
n_iters
parameter specifies the number of iterations to to use when performing dimensionality reduction. This is only used when thereduce
parameter is set toTrue
. Default is 5.
Syntax
fit HashingVectorizer <field_to_convert> [max_features=<int>] [n_iters=<int>] [reduce=<bool>] [k=<int>] [ngram_range=<int><int>] [analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]
Syntax constraints
HashingVectorizer does not support saving models, incremental fit, or Kfold cross validation.
Example
The following example uses HashingVectorizer to hash the text dataset and applies KMeans clustering (where k=3
) on the hashed fields.
 inputlookup authorization.csv  fit HashingVectorizer Logs ngram_range=12 k=50 stop_words=english  fit KMeans Logs_hashed* k=3  fields cluster* Logs  sample 5 by cluster  sort by cluster
ICA
ICA (Independent component analysis) separates a multivariate signal into additive subcomponents that are maximally independent. Typically, ICA is not used for separating superimposed signals, but for reducing dimensionality. The ICA model does not include a noise term for the model to be correct, meaning whitening must be applied. Whitening can be done internally using the whiten argument, or manually using one of the PCA variants.
Parameters
 The
n_components
parameters determines the number of components ICA uses.  The
n_components
parameter is optional.  The
n_components
parameter default isNone
. IfNone
is selected, all components are used.  Use the
algorithm
parameter to applyparallel
ordeflation
algorithm for FastICA.  The the
algorithm
parameter default isalgorithm='parallel'
.  Use the
whiten
parameter to set a noise term.  The
whiten
parameter is optional.  If the
whiten
parameter isFalse
no whitening is performed.  The
whiten
parameter default isTrue
.  The
max_iter
parameter determines the maximum number of iterations during the running of thefit
command.  The
max_iter
parameter is optional.  The
max_iter
parameter default is 200.  The
fun
parameter determines the functional form of the G function used in the approximation to negentropy.  The
fun
parameter is optional.  The
fun
parameter default islogcosh
. Other options for this parameter areexp
orcube
.  The
tol
parameter sets the tolerance on update at each iteration.  The
tol
parameter is optional.  The
tol
parameter default is 0.0001 .  The
random_state
parameter sets the seed value used by the random number generator.  The
random_state
parameter default isNone
.  If
random_state=None
then a random seed value is used.
Syntax
fit ICA n_components=<int>, algorithm=<"parallel""deflation">, whiten=<bool>, fun=<"logcosh""exp""cube">, max_iter=<int>, tol=<float>, random_state=<int> <explanatory_fields> [into <model name>]
You can save ICA models using the into
keyword and apply new data later using the apply
command.
Syntax constraints
You cannot inspect the model learned by ICA with the summary
command.
Example
The following example shows how ICA is able to find the two original sources of data from two measurements that have mixes of both. As a comparison, PCA is used to show the difference between the two – PCA is not able to identify the original sources.
 makeresults count=2  streamstats count as count  eval time=case(count=2,relative_time(now(),"+2d"),count=1,now())  makecontinuous time span=15m  eval _time=time  eval s1 = sin(2*time)  eval s2 = sin(4*time)  eval m1 = 1.5*s1 + .5*s2, m2 = .1*s1 + s2  fit ICA m1, m2 n_components=2 as IC  fit PCA m1, m2 k=2 as PC  fields _time, *  fields  count, time
KernelPCA
The KernelPCA algorithm uses the scikitlearn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. The difference between KernelPCA and PCA is the use of kernels in the former, which helps with finding nonlinear dependencies among the fields. Currently, KernelPCA only supports the Radial Basis Function (rbf) kernel.
For descriptions of the gamma
, degree
, tolerance
, and max_iteration
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html.
Kernelbased methods such as KernelPCA tend to work best when the data is scaled, for example, using our StandardScaler algorithm:  fit StandardScaler
. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
Parameters
The k
parameter specifies the number of features to be extracted from the data. The other parameters are for fine tuning of the kernel.
Syntax
fit KernelPCA <fields> [into <model name>] [degree=<int>] [k=<int>] [gamma=<int>] [tolerance=<int>] [max_iteration=<int>]
You can save KernelPCA models using the into
keyword and apply new data later using the apply
command.
...  apply user_feedback_model
Syntax constraints
You cannot inspect the model learned by KernelPCA with the summary
command.
Example
The following example uses KernelPCA on a test set.
...  fit KernelPCA * k=3 gamma=0.001  ...
PCA
The Principal Component Analysis (PCA) algorithm uses the scikitlearn PCA algorithm to reduce the number of fields by extracting new, uncorrelated features out of the data.
Parameters
 The
k
parameter specifies the number of features to be extracted from the data.  The
variance
parameter is short for percentage variance ratio explained. This parameter determines the percentage of variance ratio explained in the principal components of the PCA. It computes the number of principal components dynamically by preserving the specified variance ratio.  The
variance
parameter defaults to 1 if k is not provided.  The
variance
parameter can take a value between 0 and 1.  The
component name
parameter represents the name of the selected components from the value specified inn_components
.  The
explained_variance
parameter measures the proportion to which the principal component accounts for dispersion of a given dataset. A higher value denotes a higher variation.  The
explained_variance_ratio
parameter is the percentage of variance explained by each of the selected components.  The
singular_values
parameter represents the singular values corresponding to each of the selected components. Singular values are equal to the 2norms of then_components
variables in the lowerdimensional space.
Syntax
fit PCA <fields> [into <model name>] [k=<int>] [variance=<float>]
You can save PCA models using the into
keyword and apply new data later using the apply
command.
...into example_hard_drives_PCA_2  apply example_hard_drives_PCA_2
You can inspect the model learned by PCA with the summary
command.
 summary example_hard_drives_PCA_2
Syntax constraints
The variance
parameter and k
parameter cannot be used together. They are mutually exclusive.
Examples
The following example uses PCA on a test set.
 fit PCA "SS_SMART_1_Raw", "SS_SMART_2_Raw", "SS_SMART_3_Raw", "SS_SMART_4_Raw", "SS_SMART_5_Raw" k=2 into example_hard_drives_PCA_2
The following example includes the variance
parameter. The value variance=0.5
tells the algorithm to choose as many principal components for the data set until able to explain 50% of the variance in the original dataset.
 fit PCA "SS_SMART_1_Raw", "SS_SMART_2_Raw", "SS_SMART_3_Raw", "SS_SMART_4_Raw", "SS_SMART_5_Raw" variance=0.50 into example_hard_drives_PCA_2
TFIDF
The TFIDF algorithm uses the scikitlearn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data. For descriptions of the max_features
, max_df
, min_df
, ngram_range
, analyzer
, norm
, and token_pattern
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.
TFIDF uses memory to create a dictionary of all terms including ngrams and words, and expands the Splunk search events with additional fields per event. If you are concerned with memory limits, consider using the HashingVectorizer algorithm.
Parameters
The default for max_features
is 100.
To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set stop_words
to english
. For other languages (for example, machine language) you can ignore the common words by setting max_df
to a value greater than or equal to 0.7 and less than 1.0.
Syntax
fit TFIDF <field_to_convert> [into <model name>] [max_features=<int>] [max_df=<int>] [min_df=<int>] [ngram_range=<int><int>] [analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]
You can save TFIDF models using the into
keyword and apply new data later using the apply
command.
...  apply user_feedback_model
Syntax constraints
You cannot inspect the model learned by TFIDF with the summary
command.
Example
The following example uses TFIDF to convert the text dataset to a matrix of TFIDF features and then applies KMeans clustering (where k=3
) on the matrix.
 inputlookup authorization.csv  fit TFIDF Logs ngram_range=12 ngram_range=12 max_df=0.6 min_df=0.2 stop_words=english  fit KMeans Logs_tfidf* k=3  *fields cluster Logs  sample 6 by cluster  sort by cluster
Preprocessing (Prepare Data)
Preprocessing algorithms are used for preparing data. Other algorithms can also be used for preprocessing that may not be organized under this section. For example, PCA can be used for both Feature Extraction and Preprocessing.
Imputer
The Imputer algorithm is a preprocessing step wherein missing data is replaced with substitute values. The substitute values can be estimated, or based on other statistics or values in the dataset. To use Imputer, the user passes in the names of the fields to impute, along with arguments specifying the imputation strategy, and the values representing missing data. Imputer then adds new imputed versions of those fields to the data, which are copies of the original fields, except that their missing values are replaced by values computed according to the imputation strategy.
Parameters
 Available imputation strategies include mean, median, most frequent, and field. The default strategy is
mean
.  All but the
field
parameter require numeric data. Thefield
strategy accepts categorical data.
Syntax
..  fit Imputer <field>* [as <field prefix>] [missing_values=<"NaN"integer>] [strategy=<meanmedianmost_frequent>] [into <model name>]
You can inspect the value (mean, median, or mode) that was substituted for missing values by Imputer with the summary
command.
...  summary <imputer model name>
You can save Imputer models using the into
keyword and apply new data later using the apply
command.
...  apply <imputer model name>
Example
The following example uses Imputer on a test set.
 inputlookup server_power.csv  eval ac_power_missing=if(random() % 3 = 0, null, ac_power)  fields  ac_power  fit Imputer ac_power_missing  eval imputed=if(isnull(ac_power_missing), 1, 0)  eval ac_power_imputed=round(Imputed_ac_power_missing, 1)  fields  ac_power_missing, Imputed_ac_power_missing
RobustScaler
The RobustScaler algorithm uses the scikitlearn RobustScaler algorithm to standardize data fields by scaling their median and interquartile range to 0 and 1, respectively. It is very similar to the StandardScaler algorithm, in that it helps avoid dominance of one or more fields over others in subsequent machine learning algorithms, and is practically required for some algorithms, such as KernelPCA and SVM. The main difference between StandardScaler and RobustScaler is that RobustScaler is less sensitive to outliers.
Parameters
The with_centering
and with_scaling
parameters specify if the fields should be standardized with respect to their median and interquartile range.
Syntax
fit RobustScaler <fields> [into <model name>] [with_centering=<truefalse>] [with_scaling=<truefalse>]
You can save RobustScaler models using the into
keyword and apply new data later using the apply
command.
...  apply scaling_model
You can inspect the statistics extracted by RobustScaler with the summary
command.
...  summary scaling_model
Syntax constraints
RobustScaler does not support incremental fit.
Example
The following example uses RobustScaler on a test set.
...  fit RobustScaler *  ...
StandardScaler
The StandardScaler algorithm uses the scikitlearn StandardScaler algorithm to standardize data fields by scaling their mean and standard deviation to 0 and 1, respectively. This preprocessing step helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms. This step is practically required for some algorithms, such as KernelPCA and SVM. This algorithm supports incremental fit.
Parameters
 The
with_mean
andwith_std
parameters specify if the fields should be standardized with respect to their mean and standard deviation.  The
partial_fit
parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set. The default is False.
Syntax
fit StandardScaler <fields> [into <model name>] [with_mean=<truefalse>] [with_std=<truefalse>] [partial_fit=<truefalse>]
You can save StandardScaler models using the into
keyword and apply new data later using the apply
command.
...  apply scaling_model
You can inspect the statistics extracted by StandardScaler with the summary
command.
... summary scaling_model
Syntax constraints
 Using
partial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. Ifpartial_fit=false
orpartial_fit
is not specified (default is false), the model specified is created and replaces the pretrained model if one exists.  If
My_Incremental_Model
does not exist, the command saves the model data under the model nameMy_Incremental_Model
.  If
My_Incremental_Model
exists and was trained using StandardScaler, the command updates the existing model with the new input.  If
My_Incremental_Model
exists but was not trained by StandardScaler, an error message is thrown.
Examples
The following example uses StandardScaler on a test set.
...  fit StandardScaler *  ...
The following example includes the partial_fit
parameter.
 inputlookup track_day.csv  fit StandardScaler "batteryVoltage", "engineCoolantTemperature", "engineSpeed" partial_fit=true into My_Incremental_Model
Regressors
Regressor algorithms predict the value of a numeric field.
The kfold
crossvalidation command can be used with all Regressor algorithms. Learn more here.
DecisionTreeRegressor
The DecisionTreeRegressor algorithm uses the scikitlearn DecisionTreeRegressor estimator to fit a model to predict the value of numeric fields. For descriptions of the max_depth
, random_state
, max_features
, min_samples_split
, max_leaf_nodes
, and splitter
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html.
Parameters
To specify the maximum depth of the tree to summarize, use the limit
argument. The default value for the limit
argument is 5.
 summary model_DTC limit=10
Syntax
fit DecisionTreeRegressor <field_to_predict> from <explanatory_fields> [into <model_name>] [max_depth=<int>] [max_features=<str>] [min_samples_split=<int>] [random_state=<int>] [max_leaf_nodes=<int>] [splitter=<bestrandom>]
You can save DecisionTreeRegressor models using the into
keyword and apply it to new data later using the apply
command.
...  apply model_DTR
You can inspect the decision tree learned by DecisionTreeRegressor with the summary
command.
...  summary model_DTR
You can get a JSON representation of the tree by giving json=t
as an argument to the summary
command.
...  summary model_DTR json=t
Example
The following example uses DecisionTreeRegressor on a test set.
...  fit DecisionTreeRegressor temperature from date_month date_hour into temperature_model  ...
ElasticNet
The ElasticNet algorithm uses the scikitlearn ElasticNet estimator to fit a model to predict the value of numeric fields. ElasticNet is a linear regression model that includes both L1 and L2 regularization and is a generalization of Lasso and Ridge.
For descriptions of the fit_intercept
, normalize
, alpha
, and l1_ratio
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html.
Syntax
fit ElasticNet <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<truefalse>] [normalize=<truefalse>] [alpha=<int>] [l1_ratio=<int>]
You can save ElasticNet models using the into
keyword and apply new data later using the apply
command.
...  apply temperature_model
You can inspect the coefficients learned by ElasticNet with the summary
command.
...  summary temperature_model
Example
The following example uses ElasticNet on a test set.
...  fit ElasticNet temperature from date_month date_hour normalize=true alpha=0.5  ...
GradientBoostingRegressor
This algorithm uses the GradientBoostingRegressor algorithm from scikitlearn to build a regression model by fitting regression trees on the negative gradient of a loss function. For further information see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
Syntax
fit GradientBoostingRegressor <field_to_predict> from <explanatory_fields> [into <model_name>] [loss=<lsladhuberquantile>] [max_features=<str>] [learning_rate=<float>] [min_weight_fraction_leaf=<float>] [alpha=<float>] [subsample=<float>] [n_estimators=<int>] [max_depth=<int>] [min_samples_split=<int>] [min_samples_leaf=<int>] [max_leaf_nodes=<int>] [random_state=<int>]
You can use the apply
method to apply the trained model to the new data.
...apply temperature_model
You can inspect the features learned by GradientBoostingRegressor with the summary
command.
...  summary temperature_model
Example
The following example uses the GradientBoostingRegressor algorithm to fit a model and saves that model as temperature_model
.
...  fit GradientBoostingRegressor temperature from date_month date_hour into temperature_model  ...
KernelRidge
The KernelRidge algorithm uses the scikitlearn KernelRidge algorithm to fit a model to predict numeric fields. This algorithm uses the radial basis function (rbf) kernel by default. For details, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html.
Parameters
The gamma
parameter controls the width of the rbf kernel. The default value is 1/ number of fields
.
Syntax
fit KernelRidge <field_to_predict> from <explanatory_fields> [into <model_name>] [gamma=<float>]
You can save KernelRidge models using the into
keyword and apply new data later using the apply
command.
...  apply sla_model
Syntax constraints
You cannot inspect the model learned by KernelRidge with the summary
command.
Example
The following example uses KernelRidge on a test set.
...  fit KernelRidge temperature from date_month date_hour into temperature_model  ...
Lasso
The Lasso algorithm uses the scikitlearn Lasso estimator to fit a model to predict the value of numeric fields. Lasso is like LinearRegression, but it uses L1 regularization to learn a linear models with fewer coefficients and smaller coefficients. Lasso models are consequently more robust to noise and resilient against overfitting.
For descriptions of the alpha
, fit_intercept
, and normalize
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.Lasso.html.
Parameters
 The
alpha
parameter controls the degree of L1 regularization.  The
fit_intercept
parameter specifies whether the model should include an implicit intercept term. The default value is True.
Syntax
fit Lasso <field_to_predict> from <explanatory_fields> [into <model name>] [alpha=<float>] [fit_intercept=<truefalse>] [normalize=<truefalse>]
You can save Lasso models using the into
keyword and apply new data later using the apply
command.
...  apply temperature_model
You can inspect the coefficients learned by Lasso with the summary
command.
...  summary temperature_model
Example
The following example uses Lasso on a test set.
...  fit Lasso temperature from date_month date_hour  ...
LinearRegression
The LinearRegression algorithm uses the scikitlearn LinearRegression estimator to fit a model to predict the value of numeric fields.
Parameters
The fit_intercept
parameter specifies whether the model should include an implicit intercept term. The default value is True.
Syntax
fit LinearRegression <field_to_predict> from <explanatory_fields> [into <model name> [fit_intercept=<truefalse>] [normalize=<truefalse>]
You can save LinearRegression models using the into
keyword and apply new data later using the apply
command.
...  apply temperature_model
You can inspect the coefficients learned by LinearRegression with the summary
command.
...  summary temperature_model
Example
The following example uses LinearRegression on a test set.
...  fit LinearRegression temperature from date_month date_hour into temperature_model  ..
RandomForestRegressor
The RandomForestRegressor algorithm uses the scikitlearn RandomForestRegressor estimator to fit a model to predict the value of numeric fields. For descriptions of the n_estimators
, random_state
, max_depth
, max_features
, min_samples_split
, and max_leaf_nodes
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
Syntax
fit RandomForestRegressor <field_to_predict> from <explanatory_fields> [into <model name>] [n_estimators=<int>] [max_depth=<int>] [random_state=<int>] [max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]
You can save RandomForestRegressor models using the into
keyword and apply new data later using the apply
command.
...  apply temperature_model
You can list the features that were used to fit the model, as well as their relative importance or influence with the summary
command.
...  summary temperature_model
Example
The following example uses RandomForestRegressor on a test set.
...  fit RandomForestRegressor temperature from date_month date_hour into temperature_model  ...
Ridge
The Ridge algorithm uses the scikitlearn Ridge estimator to fit a model to predict the value of numeric fields. Ridge is like LinearRegression, but it uses L2 regularization to learn a linear models with smaller coefficients, making the algorithm more robust to collinearity. For descriptions of the fit_intercept
, normalize
, and alpha
parameters, see the scikitlearn documentation at http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.Ridge.html.
Parameters
The alpha
parameter specifies the degree of regularization. The default value is 1.0.
Syntax
fit Ridge <field_to_predict> from <explanatory_fields> [into <model name>] [fit_intercept=<truefalse>] [normalize=<truefalse>] [alpha=<int>]
You can save Ridge models using the into
keyword and apply new data later using the apply
command.
...  apply temperature_model
You can inspect the coefficients learned by Ridge with the summary
command.
...  summary temperature_model
Example
The following example uses Ridge on a test set.
...  fit Ridge temperature from date_month date_hour normalize=true alpha=0.5  ...
SGDRegressor
The SGDRegressor algorithm uses the scikitlearn SGDRegressor estimator to fit a model to predict the value of numeric fields. This algorithm supports incremental fit.
Parameters
 The
partial_fit
parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set. The default is False.  The
fit_intercept=<truefalse>
parameter determines whether the intercept should be estimated or not.  The
fit_intercept=<truefalse>
parameter default is True.  The
n_iter=<int>
parameter is the number of passes over the training data also known as epochs. The default is 5. The number of iterations is set to 1 if using
partial_fit
.
 The number of iterations is set to 1 if using
 The
penalty=<l2l1elasticnet>
parameter set the penalty or regularization term to be used. The default is l2.  The
learning_rate=<constantoptimalinvscaling>
parameter is the learning rate. constant: eta = eta0
 optimal: eta = 1.0/(alpha * t)
 invscaling: eta = eta0 / pow(t, power_t)
 default is
invscaling
.
 The
l1_ratio=<float>
parameter is the Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. Default is 0.15. l1_ratio=0 corresponds to L2 penalty
 l1_ratio=1 to L1
 The
alpha=<float>
parameter is the constant that multiplies the regularization term. Default is 0.0001. Also used to compute
learning_rate
when set to Optimal.
 Also used to compute
 The
eta0=<float>
parameter is the initial learning rate. Default is 0.01.  The
power_t=<float>
parameter is the exponent for inverse scaling learning rate. Default is 0.25.  The
random_state=<int>
parameter is the seed of the pseudo random number generator to use when shuffling the data.
Syntax
fit SGDRegressor <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<truefalse>] [fit_intercept=<truefalse>] [random_state=<int>] [n_iter=<int>] [l1_ratio=<float>] [alpha=<float>] [eta0=<float>] [power_t=<float>] [penalty=<l1l2elasticnet>] [learning_rate=<constantoptimalinvscaling>]
You can save SGDRegressor models using the into
keyword and apply new data later using the apply
command.
...  apply temperature_model
You can inspect the coefficients learned by SGDRegressor with the summary
command.
...  summary temperature_model
Syntax constraints
 If
My_Incremental_Model
does not exist, the command saves the model data under the model nameMy_Incremental_Model
.  If
My_Incremental_Model
exists and was trained using SGDRegressor, the command updates the existing model with the new input.  If
My_Incremental_Model
exists but was not trained by SGDRegressor, an error message displays.  Using
partial_fit=true
on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead.  If
partial_fit=false
orpartial_fit
is not specified the model specified is created and replaces the pretrained model if one exists.
Examples
The following example uses SGDRegressor on a test set.
...  fit SGDRegressor temperature from date_month date_hour into temperature_model  ...
The following example includes thepartial_fit
parameter.
 inputlookup server_power.csv  fit SGDRegressor "ac_power" from "totalcpuutilization" "totaldiskaccesses" partial_fit=true into My_Incremental_Model
Time Series Analysis
Forecasting algorithms, also known as time series analysis, provide methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data, and forecast its future values.
ARIMA
The Autoregressive Integrated Moving Average (ARIMA) algorithm uses the StatsModels ARIMA algorithm to fit a model on a time series for better understanding and/or forecasting its future values. An ARIMA model can consist of autoregressive terms, moving average terms, and differencing operations. The autoregressive terms express the dependency of the current value of time series to its previous ones.
The moving average terms, also called random shocks or white noise, model the effect of previous forecast errors on the current value. If the time series is nonstationary, differencing operations are used to make it stationary. A stationary process is a stochastic process in that its probability distribution does not change over time.
See the StatsModels documentation at http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.arima_model.ARIMA.html for more information.
It is highly recommended to send the time series through timechart before sending it into ARIMA to avoid nonuniform sampling time. If _time
is not to be specified, using timechart is not necessary.
Parameters
 The time series should not have any gaps or missing data otherwise ARIMA will complain. If there are missing samples in the data, using a bigger span in timechart or using streamstats to fill in the gaps with average values can do the trick.
 When chaining ARIMA output to another algorithm (i.e. ARIMA itself), keep in mind the length of the data is the length of the original data +
forecast_k
. If you want to maintain theholdback
position, you need to add the number inforecast_k
to yourholdback
value.  ARIMA requires the
order
parameter to be specified at fitting time. Theorder
parameter needs three values: Number of autoregressive (AR) parameters
 Number of differencing operations (D)
 Number of moving average (MA) Parameters
 The
forecast_k=<int>
parameter tells ARIMA how many points into the future should be forecasted. If_time
is specified during fitting along with thefield_to_forecast
, ARIMA will also generate the timestamps for forecasted values. By default,forecast_k
is zero.  The
conf_interval=<1..99>
parameter is the confidence interval in percentage around forecasted values. By default it is set to 95%.  The
holdback=<int>
parameter is the number of data points held back from the ARIMA model. This is useful for comparing the forecast against known data points. By default, holdback is zero.
Syntax
fit ARIMA [_time] <field_to_forecast> order=<int><int><int> [forecast_k=<int>] [conf_interval=<int>] [holdback=<int>]
Syntax constraints
 ARIMA supports one time series at a time.
 ARIMA models cannot be saved and used at a later time in the current version.
Example
The following example uses ARIMA on a test set.
...  fit ARIMA Voltage order=401 holdback=10 forecast_k=10
StateSpaceForecast
StateSpaceForecast is a forecasting algorithm for time series in MLTK. It is based on Kalman filters. The algorithm supports incremental fit.
Advantages of StateSpaceForecast over ARIMA include:
 Persists models created using the
fit
command that can then be used withapply
.  A
specialdays
field allows you to account for the effects of a specified list of special days.  It is automatic in that you do no need to choose parameters or mode.
 Supports multivariate forecasting.
Parameters
 By default the historical data results from running the
fit
command are not shown. To modify this behavior setoutput_fit=True
.  Use the
target
field to specify fields from which to forecast using historical data and other values.  The
target
field is a commaseparated list of fields that can be univariate or multivariate. These fields must be specified during thefit
process. Optionally use the
target
field to fit multiple fields during thefit
process but apply only a selection of those target fields during theapply
process.
 Optionally use the
 If the
target
field is not specified, then all fields will be forecast together using historical data.  The
specialdays
field specifies the field that indicates effects due to special days such as holidays.  The
specialdays
field values must be numeric and are typically 0 and 1, with 1 indicating the existence of a special day effect. Null values are treated as 0.  The majority of use cases have no
specialdays
. Events that occur regularly and frequently such as weekends should not be treated asspecialdays
. As best practice, use a minimum of two and a maximum of five instances of thespecialdays
to capture events such as holiday sales.  * Use
specialdays
in theapply
step if it has been specified duringfit
. The same field(s) must be assigned.  Use the
period
parameter to specify if your data has a known periodicity.  If the
period
parameter is not specified it is computed automatically.  Set
period=1
to treat the time series as nonperiodic.  As with other MLTK algorithms, the
partial_fit
parameter controls whether a model should be incrementally updated or not. This allows you to update a model using only new data without having to retrain the model on the full dataset.  The default for
partial_fit
is False.  Use
update_last
to modify the behavior ofpartial_fit
 The default for
update_last
is False.  If
partial_fit=True
StateSpaceForecast will first update the model parameters and then predict.  If
partial_fit=True
andupdate_last=True
StateSpaceForecast will first predict and then update the model parameters. This allows you to review the forecast before running new data through.  The
holdback=<int>
parameter is the number of data points held back from training. This is useful for comparing the forecast against known data points. Default holdback value is 0.  If you want to maintain the
holdback
position, add the position number inforecast_k
to yourholdback
value.  The
forecast_k=<int>
parameter tells StateSpaceForecast how many points into the future should be forecasted. If_time
is specified during fitting along with thefield_to_forecast
, StateSpaceForecast also generates the timestamps for forecasted values. Default,forecast_k
value is 0.  The
conf_interval=<1..99>
parameter is the confidence interval in percentage around forecasted values. Input an integer between 1 and 99 where a larger number means a greater tolerance for forecast uncertainty. The default integer is 95.  The
as
field to gives aliases for forecasted fields.  In univariate cases the
as
fieldfieldlist
is a single field name.  In multivariate cases, the
as
field adheres to the following conventions: The list must be in double quotes, separated by either spaces or commas.
 The aliases correspond to the original fields in the given order.
 The number of aliases can be smaller than the number of original fields.
 The
summary
command lists the names of the fields used in thefit
command step, the name of thespecialdays
field, and the period.
Syntax
 fit StateSpaceForecast <fields> [from *] [specialdays=<field name>] [holdback=<int>] [forecast_k=<int>] [conf_interval=<float>] [period=<int>] [partial_fit=<truefalse>] [update_last=<truefalse>] [output_fit=<truefalse>] [into <model name>] [as <fieldlist>]
You can apply the saved model to new data with the apply
command.
 apply <model name> [specialdays=<field name>] [target=<fields>] [holdback=<int  timerange>] [forecast_k=<int>] [conf_interval=<float>]
You can inspect the model learned by StateSpaceForecast with the summary
command.
 summary <model name>
Syntax constraints
 For univariate analysis the
fields
parameter is a single field, but for multivariate analysis it is a list of fields.  For multivariate analysis, only one
specialdays
field can be specified and it applies to all the fields.  The
specialdays
field values must be numeric.  Null values in the
specialdays
field are treated as 0.  Double quotes are required around field lists.
Examples
The following is a univariate example of StateSpaceForecast on a test set. The example is considered univariate as there is only a single field following  fit StateSpaceForecast
.
 inputlookup milk2.csv  fit StateSpaceForecast milk_production from * specialdays=holiday into milk_model  apply milk_model specialdays=holiday forecast_k=30
The following is a multivariate example of StateSpaceForecast on a test set. The syntax is the same as that in the univariate example, except that this case has a list of fields (CRM, ERP, and Expenses) following  fit StateSpaceForecast
, making it multivariate.
 inputlookup app_usage.csv  fields CRM ERP Expenses  fit StateSpaceForecast CRM ERP Expenses holdback=12 into app_usage_model as "crm, erp"
The following example is also multivariate and includes the target
field. In this example the fields of CRM
and ERP
are forecast using historical data and the Expenses
field. The apply
command is used against the model created in the fit
command step, resulting in the app_usage_model
model.
Double quotes are required around any field list.
 inputlookup app_usage.csv  fields CRM ERP Expenses  apply app_usage_model target="CRM, ERP" forecast_k=36 holdback=36
The following example is again multivariate but without the target
field. This example forecasts the fields CRM
, ERP
, and Expenses
using historical data.
 inputlookup app_usage.csv  fields CRM ERP Expenses  apply app_usage_model forecast_k=36 holdback=36
The following example shows how to improve your output with StateSpaceForecast.
 inputlookup cyclical_business_process_with_external_anomalies.csv  eval holiday=if(random()%100<98,0,1)  fit StateSpaceForecast logons from logons into My_Model forecast_k=3000
Adding of the SPL line period=2016
could improve the output, but would not account for the period being seven days rather than twentyfour hours.
 inputlookup cyclical_business_process_with_external_anomalies.csv  table _time,logons  eval holiday=if(random()%100<98,0,1)  eval dayOfWeek=strftime(_time,"%a")  eval holidayWeekend=case(in(dayOfWeek,"Sat","Sun"),1,true(),0)  apply MyBadModel specialdays=holidayWeekend forecast_k=3000  eval old_predict='predicted(logons)'  eval dayOfWeek=strftime(_time,"%a")  eval holidayWeekend=case(in(dayOfWeek,"Sat","Sun"),1,true(),0)  apply My_Model specialdays=holidayWeekend holdback=3000 forecast_k=3000
Utility Algorithms
These utility algorithms are not machine learning algorithms, but provide methods to calculate data characteristics. These algorithms facilitate the process of algorithm selection and parameter selection. See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html for more information.
ACF (autocorrelation function)
ACF (autocorrelation function) calculates the correlation between a sequence and a shifted copy of itself, as a function of shift
. Shift is also referred to as lag or delay.
Parameters
 The
k
parameter specifies the number of lags to return autocorrelation for. By defaultk
is 40.  The
fft
parameter specifies whether ACF is computed via Fast Fourier Transform (FFT). By defaultfft
is False.  The
conf_interval
parameter specifies the confidence interval in percentage to return. By defaultconf_interval
is set to 95.
Syntax
fit ACF <field> [k=<int>] [fft=truefalse] [conf_interval=<int>]
Example
The following example uses ACF (autocorrelation function) on a test set.
...  fit ACF logins k=50 fft=true conf_interval=90
PACF (partial autocorrelation function)
PACF (partial autocorrelation function) gives the partial correlation between a sequence and its lagged values, controlling for the values of lags that are shorter than its own. See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.pacf.html for more information.
Parameters
 The
k
parameter specifies the number of lags to return partial autocorrelation for. By defaultk
is 40.  The
method
parameter specifies which method for the calculation to use. By defaultmethod
is unbiased.  The
conf_interval
parameter specifies the confidence interval in percentage to return. By defaultconf_interval
is set to 95.
Syntax
fit PACF <field> [k=<int>] [method=<ywunbiasedywmleols>] [conf_interval=<int>]
Example
The following example uses PACF (partial autocorrelation function) on a test set.
...  fit PACF logins k=20 conf_interval=90
PREVIOUS Using the score command 
NEXT Import a machine learning algorithm from Splunkbase 
This documentation applies to the following versions of Splunk^{®} Machine Learning Toolkit: 4.2.0
Feedback submitted, thanks!