Splunk® Machine Learning Toolkit

User Guide

Acrobat logo Download manual as PDF


This documentation does not apply to the most recent version of Splunk® Machine Learning Toolkit. For documentation on the most recent version, go to the latest release.
Acrobat logo Download topic as PDF

Algorithms in the Machine Learning Toolkit

The Splunk Machine Learning Toolkit (MLTK) supports all of the algorithms listed here. Details for each algorithm are grouped by algorithm type including Anomaly Detection, Classifiers, Clustering Algorithms, Cross-validation, Feature Extraction, Preprocessing, Regressors, Time Series Analysis, and Utility Algorithms. You can find more examples for these algorithms on the scikit-learn website.

The MLTK supported algorithms use the fit and apply commands. For information on the steps taken by these commands, see Understanding the fit and apply commands.

For information on using the score command, see Scoring metrics in the Machine Learning Toolkit.

ML-SPL Quick Reference Guide

Download the Machine Learning Toolkit Quick Reference Guide for a handy cheat sheet of current ML-SPL commands and machine learning algorithms available in the Splunk Machine Learning Toolkit. This document is also offered in Japanese.

ML-SPL Performance App

Download the ML-SPL Performance App for the Machine Learning Toolkit to use performance results for guidance and benchmarking purposes in your own environment.

Extend the algorithms you can use for your models

The algorithms listed here and in the ML-SPL Quick Reference Guide are available natively in the Splunk Machine Learning Toolkit. You can also base your algorithm on over 300 open source Python algorithms from scikit-learn, pandas, statsmodel, numpy and scipy libraries available through the Python for Scientific Computing add-on in Splunkbase.

For information on how to import an algorithm from the Python for Scientific Computing add-on into the Splunk Machine Learning Toolkit, see the ML-SPL API Guide.

Add algorithms through GitHub

On-prem customers looking for solutions that fall outside of the 30 native algorithms can use GitHub to add more algorithms. Join the Splunk Community for MLTK on GitHub. to also learn about new machine learning algorithms, solve custom uses cases through sharing and reusing algorithms, and help fellow users of the MLTK.

Splunk Cloud Platform customers can also use GitHub to add more algorithms via an app. The Splunk GitHub for Machine learning app provides access to custom algorithms and is based on the Machine Learning Toolkit open source repo. Splunk Cloud Platform customers need to create a support ticket to have this app installed.

Anomaly Detection

Anomaly detection algorithms detect anomalies and outliers in numerical or categorical fields.

DensityFunction

The DensityFunction algorithm provides a consistent and streamlined workflow to create and store density functions and utilize them for anomaly detection. DensityFunction allows for grouping of the data using the by clause, where for each group a separate density function is fitted and stored. This algorithm supports incremental fit.

The DensityFunction algorithm supports the following continuous probability density functions: Normal, Exponential, Gaussian Kernel Density Estimation (Gaussian KDE), and Beta distribution.

Using the DensityFunction algorithm requires running version 1.4 or above of the Python for Scientific Computing add-on.

The accuracy of the anomaly detection for DensityFunction depends on the quality and the size of the training dataset, how accurately the fitted distribution models the underlying process that generates the data, and the value chosen for the threshold parameter.

Follow these guidelines to make your models perform more accurately:

  • Aim for fitted distributions to have a cardinality (training dataset size) of at least 50. If you cannot collect more training data, create fewer groups of data using the by clause, giving you more data points per group.
  • The threshold parameter has a default value, but ideally the value for threshold, lower_threshold, or upper_threshold are chosen based on experimentation as guided by domain knowledge.
  • Continue tuning the threshold parameter until you are satisfied with the results.
  • Inspect the model using the summary command.
    • The values reported for the mean and standard deviation are either the statistics of the fitted distribution, or of the data, depending on the type of the distribution.
    • In the case of parametric distributions (Normal, Beta, and Exponential) the mean and standard deviation are calculated from the fitted distribution. When the parametric distribution is not a good fit for the data, the reported mean and std might not be close to that of data.
    • In the case of non-parametric distributions (Gaussian KDE) the mean and standard deviation are calculated from the data passed in during fit.
  • If the distribution of the data changes through time, re-train your models frequently.

Parameters

  • The partial_fit parameter controls whether an existing model should be incrementally updated on not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.
    • The partial_fit parameter default is False.
    • If partial_fit is not specified, the model specified is created and replaces the pre-trained model if one exists.
  • Using partial_fit=True on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead.
  • Valid values for the dist parameter include: norm (normal distribution), expon (exponential distribution), gaussian_kde (Gaussian KDE distribution), beta (beta distribution), and auto (automatic selection).
    • The dist parameter default is auto.
    • When set to auto, norm (normal distribution), expon (exponential distribution), gaussian_kde (Gaussian KDE distribution) , and beta (beta distribution) all run, with the best results returned.
  • Beta distribution was added in version 5.2.0 of the Machine Learning Toolkit
    • If the data distribution takes a U shape, outlier detection will not be accurate.
  • The metric parameter calculates the distance between the sampled dataset from the density function and the training dataset.
  • Valid metrics for the metric parameter include: kolmogorov_smirnov and wasserstein.
  • The metric parameter default is wasserstein.
  • The sample parameter can be used during fit or apply stages.
  • The sample parameter default is False.
  • If the sample parameter is set to True during the fit stage, the size of the samples will be equal to the training dataset.
  • If the sample parameter is set to True during the apply stage, the size of the samples will be equal to the testing dataset.
  • If the sample parameter is set to True:
    • Samples are taken from the fitted density function.
    • Results output in a new column called SampledValue.
    • Sampled values only come from the inlier region of the distribution.
  • The full_sample parameter can be used during fit or apply stages.
  • The full_sample parameter default is False.
  • If the full_sample parameter is set to True during the fit stage, the size of the samples will be equal to the training dataset.
  • If the full_sample parameter is set to True during the apply stage, the size of the samples will be equal to the testing dataset.
  • If the full_sample parameter is set to True:
    • Samples are taken from the fitted density function.
    • Results output in a new column called FullSampledValue.
    • Sampled values come from the whole distribution (both inlier and outlier regions).
  • Use the summary command to inspect the model.
    • The values reported for the mean and standard deviation are either the statistics of the fitted distribution, or of the data, depending on the type of the distribution.
    • In the case of parametric distributions (Normal, Beta, and Exponential) the mean and standard deviation are calculated from the fitted distribution. When the parametric distribution is not a good fit for the data, the reported mean and std might not be close to that of data.
    • In the case of non-parametric distributions (Gaussian KDE) the mean and standard deviation are calculated from the data passed in during fit.
  • Version 4.4.0 of the MLTK and above support min and max values in summary.
    • The min value is the minimum value of the dataset on which the density function is fitted.
    • The max value is the maximum value of the dataset on which the density function is fitted.
  • The cardinality value generated by the summary command represents the number of data points used when fitting the selected density function.
  • The distance value generated by the summary command represents the metric type used when calculating the distance as well as the distance between the sampled data points from the density function and the training dataset.
  • The mean value generated by the summary command is the mean of the density function.
  • The value for std generated by the summary command represents the standard deviation of the density function.
  • A value under other represents any parameters other than mean and std as applicable. In the case of Gaussian KDE, other could show parameter size or bandwidth.
  • The type field generated by the summary command shows both the chosen density function as well as if the dist parameter is set to auto.
  • The show_density parameter default is False. If the parameter is set to True, the density of each data point will be provided as output in a new field called ProbabilityDensity.
  • The output for ProbabilityDensity is the probability density of the data point according to the fitted probability density. This output is provided when the show_density parameter is set to True.
  • The fit command will fit a probability density function over the data, optionally store the resulting distribution's parameters in a model file, and output the outlier in a new field called IsOutlier.
  • The output for IsOutlier is a list of labels. Number 1 represents outliers, and 0 represents inliers, assigned to each data point. Outliers are detected based on the values set for the threshold parameter. Inspect the IsOutlier results column to see how well the outlier detection is performing.
  • The parameters threshold, lower_threshold, and upper_threshold control the outlier detection process.
  • The threshold parameter is the center of the outlier detection process. It represents the percentage of the area under the density function and has a value between 0.000000001 (refers to ~0%) and 1 (refers to 100%). The threshold parameter guides the DensityFunction algorithm to mark outlier areas on the fitted distribution. For example, if threshold=0.01, then 1% of the fitted density function will be set as the outlier area.
  • The threshold parameter default value is 0.01.
  • The threshold, lower_threshold, and upper_threshold parameters can take multiple values.
    • Multiple values must be in quotation marks and separated by commas.
    • In cases of multiple values for threshold, the default maximum is 5. Users with access permissions can change this default maximum under the Settings tab.
    • In cases of multiple values, you are limited to one type of threshold (threshold,lower_threshold, or upper_threshold).
  • The output for BoundaryRanges is the boundary ranges of outliers on the density function which are set according to the values of the threshold parameter.
  • Each boundary region has three values: boundary opening point, boundary closing point, and percentage of boundary region.
  • The boundary region syntax follows the convention of a multi-value field where each boundary region appears in a new line:
first_boundary_region
second_boundary_region
n_th_boundary_region
  • When multiple thresholds are provided, Boundary Ranges for each threshold appears in a different column separated with the suffix of _th=and the threshold values:
BoundaryRanges_th=threshold_val_1
first_boundary_region_of_th1
second_boundary_region_of_th1
n_th_boundary_region_of_th1
BoundaryRanges_th=threshold_val_2
first_boundary_region_of_th2
second_boundary_region_of_th2
n_th_boundary_region_of_th2
  • In cases of a single boundary region, the value for the percentage of boundary region is equal to the threshold parameter value.
  • In some distributions (for example Gaussian KDE), the sum of outlier areas might not add up to the exact value of threshold parameter value, but will be a close approximation.
  • BoundaryRanges is calculated as an approximation and will be empty in the following two cases:
    • Where the density function has a sharp peak from low standard deviation.
    • When there are a low number of data points.
  • Data points that are exactly at the boundary opening or closing point are assigned as inliers. An opening or closing point is determined by the density function in use.
  • Normal density function has left and right boundary regions. Data points on the left of the left boundary closing point, and data points on the right of the right boundary opening point are assigned as outliers.
  • Exponential density function has one boundary region. Data points on the right of the right boundary opening point are assigned as outliers.
  • Beta density function has one boundary region. Data points on the left of the left boundary closing point are assigned as outliers.
  • Gaussian KDE density function can have one or more boundary regions, depending on the number of peaks and dips within the density function. Data points in these boundary regions are assigned as outliers. In cases of boundary regions to the left or right, guidelines from Normal density function apply. As the shape for Gaussian KDE density function can differ from dataset to dataset, you do not consistently observe left and right boundary regions.
  • The random_state parameter is the seed of the pseudo random number generator to use when creating the model. This parameter is optional but the value must be an integer.

The random_state parameter is available in MLTK version 5.0.0 and above. This parameter is not supported in version 4.5.0 of the MLTK.

Syntax

| fit DensityFunction <field> [by "<field1>[,<field2>,....<field5>]"] [into <model name>] [dist=<str>] [show_density=true|false] [sample=true|false][full_sample=true|false][threshold=<float>|lower_threshold=<float>|upper_threshold=<float>] [metric=<str>] [random_state=<int>] [partial_fit=<true|false>]

You can apply the saved model to new data with the apply command, with the option to update the parameters for threshold, lower_threshold, upper_threshold, and show_density. Parameters for dist and metric cannot be applied at this stage, and any new values provided will be ignored.

apply <model name> [threshold=<float>|lower_threshold=<float>|upper_threshold=<float>] [show_density=true|false][sample=true|false][full_sample=true|false]

You can inspect the model learned by DensityFunction with the summary command. Version 4.4.0 of the MLTK or above supports min and max values in the summary command.

| summary <model name>

Syntax constraints

  • Fields within the by clause must be given in quotation marks.
  • The maximum number of fields within the by clause is 5.
  • The total number of groups calculated with the by clause can not exceed 1024. In an example clause of by "DayOfWeek,HourOfDay" there are two fields: one for DayOfWeek and one for HourOfDay. As there are seven days in a week, there are seven groups for DayOfWeek. As there are twenty-four hours in a day, there are twenty-four groups for HourOfDay. Meaning the total number of groups calculated with the by clause is 7*24= 168.
    • The limited number of groups prevents model files from growing too large. You can increase the limit by changing the value of max_groups in the DensityFunction settings. Larger limits mean larger model files and longer load times when running apply.
    • Decrease max_kde_parameter_size to allow for the increase of max_groups. This change keeps model sizes small while allowing for increased groups.
  • Field names used within the by clause that match any one of the reserved summary field names, produces an error. You must rename your field(s) used within the by clause to fix the error. Reserved summary field names include: type, min, max, mean, std, cardinality, distance, and other.
  • The parameters threshold, lower_threshold, and upper_threshold must be within the range of 0.00000001 to 1.
  • If the parameters of lower_threshold and upper_threshold are both provided, the summation of these parameters must be less than 1 (100%).
  • The threshold and lower_threshold / upper_threshold parameters can not be specified together.
  • The threshold, lower_threshold, and upper_threshold parameters can take multiple values but in these cases you are limited to one type of threshold (threshold,lower_threshold, or upper_threshold).
  • Exponential density function only supports threshold and upper_threshold.
  • Exponential density function supports using lower_threshold but results in empty Boundary regions and 0 outliers.
  • Normal density function supports either threshold or lower_threshold / upper_threshold.
  • Gaussian KDE density function supports either threshold or lower_threshold / upper_threshold.
  • The parameters lower_threshold and upper_threshold can be used with any density function including auto.
    • Exponential density function supports using lower_threshold but results in empty Boundary regions and 0 outliers.
  • If you use the summary command to inspect a model created in version 4.3.0 of the MLTK or earlier (prior to the support of min and max), approximate values for min and max are used.

Examples

The following example shows DensityFunction on a dataset with the fit command.

| inputlookup call_center.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats max(count) as Actual by HourOfDay,BucketMinuteOfHour,DayOfWeek,source,_time
| fit DensityFunction Actual by "HourOfDay,BucketMinuteOfHour,DayOfWeek" into mymodel

This image of the MLTK shows the Statistics tab with many results listed. The fit command is included in the SPL written in the search string. Both numeric and categorical values are listed under columns including hour of day, day of week, source, and time.

The following example shows DensityFunction on a dataset with the apply command.

| inputlookup call_center.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats max(count) as Actual by HourOfDay,BucketMinuteOfHour,DayOfWeek,source,_time
| apply mymodel show_density=True sample=True

This image of the toolkit shows the Statistics tab with many results listed. The apply command as well as the sample command are included in the SPL written in the search string. Both numeric and categorical values are listed under columns including hour of day, day of week, source, time, and sampled value.

The following example shows DensityFunction on a dataset with the summary command. This example includes min and max values, which are supported in version 4.4.0 and above of the MLTK.

| summary mymodel

This image of the toolkit shows the Statistics tab with many results listed. The summary command is included in the SPL written in the search string. Both numeric and categorical values are listed under columns including bucket minute of hour, cardinality, mean, std, type, min, and max.

The following example shows BoundaryRages on a test set. In this example the threshold is set to 30% (0.3). The first row has a left boundary range which starts at -Infinity and goes up to the number 44.6912. The area of the left boundary range is 15% of the total area under the density function. It has also a right boundary range which starts at a number 518.3088 and goes up to Infinity. Again, the area of the right boundary range is the same as the left boundary range with 15% of the total area under the density function. The areas of right and left boundary ranges add up to the threshold value of 30%. The third row has only one boundary range which starts at number 300.0943 and goes up to Infinity. The area of the boundary range is 30% of the area under the density function.

| inputlookup call_center.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats max(count) as Actual by HourOfDay, BucketMinuteOfHour, DayOfWeek, source, _time
| fit DensityFunction Actual by "HourOfDay, BucketMinuteOfHour, DayOfWeek" threshold=0.3 into mymodel

This image of the toolkit shows the Statistics tab with results listed in columns that include hour of day, source, time, Actual, Boundary Ranges, and Is Outlier.

LocalOutlierFactor

The LocalOutlierFactor algorithm uses the scikit-learn Local Outlier Factor (LOF) to measure the local deviation of density of a given sample with respect to its neighbors. LocalOutlierFactor performs one-shot learning and is limited to fitting on training data and returning outliers. LocalOutlierFactor is an unsupervised outlier detection method. The anomaly score depends on how isolated the object is with respect to its neighbors.

For descriptions of the n_neighbors, leaf_size and other parameters, see the sci-kit learn documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

Using the LocalOutlierFactor algorithm requires running version 1.3 or above of the Python for Scientific Computing add-on.

Parameters

  • The anomaly_score parameter default is True. Disable this default by adding the False keyword to the command.
  • The n_neighbors parameter default is 20
  • The leaf_size parameter default is 30
  • The p parameter is limited to p >=1
  • The contamination parameter must be within the range of 0.0 (not included) to 0.5 (included)
  • The contamination parameter default is 0.1
  • Options for the algorithm parameter include: brute, kd_tree, ball_tree, and auto. The default is auto.
  • The brute, kd_tree, ball_tree, and auto algorithm options have respective valid metrics. The default metric for each is minkowski.
    • Valid metrics for brute include: cityblock, euclidean, l1, l2, manhattan, chebyshev, minkowski, braycurtis, canberra, dice, hamming, jaccard, kulsinski, matching, rogerstanimoto, russellrao, sokalmichener, sokalsneath, cosine, correlation, sqeuclidean, and yule.
    • Valid metrics for kd_tree include: cityblock, euclidean, l1, l2, manhattan, chebyshev, and minkowski.
    • Valid metrics for ball_tree include: cityblock, euclidean, l1, l2, manhattan, chebyshev, minkowski, braycurtis, canberra, dice, hamming, jaccard, kulsinski, matching, rogerstanimoto, russellrao, sokalmichener, and sokalsneath.
  • The output for LocalOutlierFactor is a list of labels titled is_outlier, assigned 1 for outliers, and -1 for inliers

Syntax

fit LocalOutlierFactor <fields>
[n_neighbors=<int>] [leaf_size=<int>] [p=<int>]
[contamination=<float>]
[metric=<str>] [algorithm=<str>] [anomaly_score=<true|false>] 

Syntax constraints

  • You cannot save LocalOutlierFactor models using the into keyword. This algorithm does not support saving models and you cannot apply a saved model to new data.
  • LOF does not include the predict method.

Example

The following example uses LocalOutlierFactor on a test set.

| inputlookup iris.csv | fit LocalOutlierFactor petal_length petal_width n_neighbors=10 algorithm=kd_tree metric=minkowski p=1 contamination=0.14 leaf_size=10

OneClassSVM

The OneClassSVM algorithm uses the scikit-learn OneClassSVM to fit a model from a set of features or fields for detecting anomalies and outliers, where features are expected to contain numerical values. OneClassSVM is an unsupervised outlier detection method.

For further information, see the sci-kit learn documentation: http://scikit-learn.org/stable/modules/svm.html#kernel-functions

Parameters

  • The kernel parameter specifies the kernel type for using in the algorithm, where the default value is kernel is rbf.
    • Kernel types include: linear, rbf, poly, and sigmoid.
  • You can specify the upper bound on the fraction of training error as well as the lower bound of the fraction of support vectors using the nu parameter, where the default value is 0.5.
  • The degree parameter is ignored by all kernels except the polynomial kernel, where the default value is 3.
  • gamma is the kernel co-efficient that specifies how much influence a single data instance has, where the default value is 1/ number of features.
  • The independent term of coef0 in the kernel function is only significant if you have polynomial or sigmoid function.
  • The term tol is the tolerance for stopping criteria.
  • The shrinking parameter determines whether to use the shrinking heuristic.

Syntax

fit OneClassSVM <fields> [into <model name>]
[kernel=<str>] [nu=<float>] [coef0=<float>]
[gamma=<float>] [tol=<float>] [degree=<int>] [shrinking=<true|false>]
  • You can save OneClassSVM models using the into keyword.
  • You can apply the saved model later to new data with the apply command.

Syntax constraints

  • After running the fit or apply command, a new field named isNormal is generated. This field defines whether a particular record (row) is normal (isNormal=1) or anomalous (isNormal=-1).
  • You cannot inspect the model learned by OneClassSVM with the summary command.

Example

The following example uses OneClassSVM on a test set.

... | fit OneClassSVM * kernel="poly" nu=0.5 coef0=0.5 gamma=0.5 tol=1 degree=3 shrinking=f into
TESTMODEL_OneClassSVM

Classifiers

Classifier algorithms predict the value of a categorical field.

The kfold cross-validation command can be used with all Classifier algorithms. For details, see K-fold cross-validation.

AutoPrediction

AutoPrediction automatically determines the data type as categorical or numeric. AutoPrediction then invokes the RandomForestClassifier algorithm to carry out the prediction. For further details, see RandomForestClassifier. AutoPrediction also executes the data split for training and testing during the fit process, eliminating the need for a separate command or macro. AutoPrediction uses particular cases to determine the data type, and uses the train_test_split function from sklearn to perform the data split.

Parameters

  • Use the target_type parameter to specify the target field as numeric or categorical.
  • The target_type parameter default is auto. When auto is used, AutoPrediction automatically determines the target field type.
  • AutoPrediction uses the following data types to determine the target_type field as categorical:
    • Data of type bool, str, or numpy.object
    • Data of type int and the criterion option is specified
  • AutoPrediction determines the target_type field as numeric for all other cases.
  • The test_split_ratio specifies the splitting of data for model training and model validation. Value must be a float between 0 (inclusive) and 1 (exclusive).
  • The test_split_ratio default is 0. A value of 0 means all data points get used to train the model.
    • A test_split_ratio value of 0.3, for example, means 30% for the data points get used for testing and 70% are used for training.
  • Use n_estimators to optionally specify the number of trees.
  • Use max_depth to optionally set the maximum depth of the tree.
  • Specify the criterion value for classification (categorical) scenarios.
  • Ignore the criterion value for regression (numeric) scenarios.

Syntax

fit AutoPrediction Target from Predictors* into PredictorModel target_type=<auto|numeric|categorical> test_split_ratio=<[0-1]>[n_estimators=<int>] [max_depth=<int>] 
[criterion=<gini | entropy>] [random_state=<int>][max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]

You can save AutoPrediction models using the into keyword and apply the saved model later to new data using the apply command.

 ... | apply PredictorModel

You can inspect the model learned by AutoPrediction with the summary command.

 .... | summary PredictorModel

Syntax constraints

  • AutoPrediction does not support partial_fit.
  • Classification performance output columns for accuracy, f1, precision, and recall only appear if the target_type is categorical.
  • Regression performance output columns for RMSE and rSquared only appear if the target_type is numeric.

Example

The following example uses AutoPrediction on a test set.


| fit AutoPrediction random_state=42 species from * max_features=0.1 into auto_classify_model test_split_ratio=0.3 random_state=42

BernoulliNB

The BernoulliNB algorithm uses the scikit-learn BernoulliNB estimator to fit a model to predict the value of categorical fields where explanatory variables are assumed to be binary-valued. BernoulliNB is an implementation of the Naive Bayes classification algorithm. This algorithm supports incremental fit.

Parameters

  • The alpha parameter controls Laplace/ Lidstone smoothing. The default value is 1.0.
  • The binarize parameter is a threshold that can be used for converting numeric field values to the binary values expected by BernoulliNB. The default value is 0.
    • If binarize=0 is specified, the default, values > 0 are assumed to be 1, and values <= 0 are assumed to be 0.
  • The fit_prior Boolean parameter specifies whether to learn class prior probabilities. The default value is True. If fit_prior=f is specified, classes are assumed to have uniform popularity.

Syntax

fit BernoulliNB <field_to_predict> from <explanatory_fields> [into <model name>]
[alpha=<float>] [binarize=<float>] [fit_prior=<true|false>] [partial_fit=<true|false>]

You can save BernoulliNB models using the into keyword and apply the saved model later to new data using the apply command.

 ... | apply TESTMODEL_BernoulliNB

You can inspect the model learned by BernoulliNB with the summary command as well as view the class and log probability information as calculated by the dataset.

 .... | summary My_Incremental_Model

Syntax constraints

  • The partial_fit parameter controls whether an existing model should be incrementally updated or not. The default value is False, meaning it will not be incrementally updated. Choosing partial_fit=True allows you to update an existing model using only new data without having to retrain it on the full training data set.
  • Using partial_fit=True on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If partial_fit=False or partial_fit is not specified (default is False), the model specified is created and replaces the pre-trained model if one exists.
  • If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model. If My_Incremental_Model exists and was trained using BernoulliNB, the command updates the existing model with the new input. If My_Incremental_Model exists but was not trained by BernoulliNB, an error message displays.

Example

The following example uses BernoulliNB on a test set.

... | fit BernoulliNB type from * into TESTMODEL_BernoulliNB alpha=0.5 binarize=0 fit_prior=f

DecisionTreeClassifier

The DecisionTreeClassifier algorithm uses the scikit-learn DecisionTreeClassifier estimator to fit a model to predict the value of categorical fields. For further information, see the sci-kit learn documentation: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.

Parameters

To specify the maximum depth of the tree to summarize, use the limit argument. The default value for the limit argument is 5.

 ... | summary model_DTC limit=10

Syntax

fit DecisionTreeClassifier <field_to_predict> from <explanatory_fields> [into <model_name>] 
[max_depth=<int>] [max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>] 
[criterion=<gini|entropy>] [splitter=<best|random>] [random_state=<int>]

You can save DecisionTreeClassifier models by using the into keyword and apply it to new data later by using the apply command.

... | apply model_DTC

You can inspect the decision tree learned by DecisionTreeClassifier with the summary command.

 ... | summary model_DTC

See a JSON representation of the tree by giving json=t as an argument to the summary command.

 ... | summary model_DTC json=t

Example

The following example uses DecisionTreeClassifier on a test set.

... | fit DecisionTreeClassifier SLA_violation from * into sla_model | ...

GaussianNB

The GaussianNB algorithm uses the scikit-learn GaussianNB estimator to fit a model to predict the value of categorical fields, where the likelihood of explanatory variables is assumed to be Gaussian. GaussianNB is an implementation of Gaussian Naive Bayes classification algorithm. This algorithm supports incremental fit.

Parameters

  • The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.
  • The partial_fit parameter default is False.

Syntax

fit GaussianNB <field_to_predict> from <explanatory_fields> [into <model name>] [partial_fit=<true|false>]

You can save GaussianNB models using the into keyword and apply the saved model later to new data using the apply command.

... | apply TESTMODEL_GaussianNB

You can inspect models learned by GaussianNB with the summary command.

 ... | summary My_Incremental_Model

Syntax constraints

  • If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model. If My_Incremental_Model exists and was trained using GaussianNB, the command updates the existing model with the new input. If My_Incremental_Model exists but was not trained by GaussianNB, an error message is thrown.
  • If partial_fit=False or partial_fit is not specified the model specified is created and replaces the pre-trained model if one exists.

Example

The following example uses GaussianNB on a test set.

... | fit GaussianNB species from * into TESTMODEL_GaussianNB

The following example includes the partial_fit command.

| inputlookup iris.csv | fit GaussianNB species from * partial_fit=true into My_Incremental_Model

GradientBoostingClassifier

This algorithm uses the GradientBoostingClassifier from scikit-learn to build a classification model by fitting regression trees on the negative gradient of a deviance loss function. For further information, see the sci-kit learn documentation: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.

Syntax

fit GradientBoostingClassifier <field_to_predict> from <explanatory_fields>[into <model name>]  
[loss=<deviance | exponential>] [max_features=<str>] 
[learning_rate =<float>] [min_weight_fraction_leaf=<float>] [n_estimators=<int>] 
[max_depth=<int>] [min_samples_split =<int>] [min_samples_leaf=<int>] 
[max_leaf_nodes=<int>] [random_state=<int>]

You can apply the saved model later to new data using the apply command.

... | apply TESTMODEL_GradientBoostingClassifier

You can inspect features learned by GradientBoostingClassifier with the summary command.

 ... | summary TESTMODEL_GradientBoostingClassifier 

Example

The following example uses GradientBoostingClassifier on a test set.

... | fit GradientBoostingClassifier target from * into TESTMODEL_GradientBoostingClassifier  

LogisticRegression

The LogisticRegression algorithm uses the scikit-learn LogisticRegression estimator to fit a model to predict the value of categorical fields.

Parameters

  • The fit_intercept parameter specifies whether the model includes an implicit intercept term.
  • The default value of the fit_intercept parameter is True.
  • The probabilities parameter specifies whether probabilities for each possible field value should be returned alongside the predicted value.
  • The default value of the probabilities parameter is False.

Syntax

fit LogisticRegression <field_to_predict> from <explanatory_fields> [into <model name>]
[fit_intercept=<true|false>] [probabilities=<true|false>]

You can save LogisticRegression models using the into keyword and apply new data later using the apply command.

... | apply sla_model

You can inspect the coefficients learned by LogisticRegression with the summary command.

 ... | summary sla_model

Example

The following examples uses LogisticRegression on a test set.

... | fit LogisticRegression SLA_violation from IO_wait_time into sla_model | ...

MLPClassifier

The MLPClassifier algorithm uses the scikit-learn Multi-layer Perceptron estimator for classification. MLPClassifier uses a feedforward artificial neural network model that trains using backpropagation. This algorithm supports incremental fit.

For descriptions of the batch_size , random_state and max_iter parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

Using the MLPClassifier algorithm requires running version 1.3 or above of the Python for Scientific Computing add-on.

Parameters

  • The partial_fit parameter controls whether an existing model should be incrementally updated on not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.
  • The partial_fit parameter default is False.
  • The hidden_layer_sizes parameter format (int) varies based on the number of hidden layers in the data.

Syntax

fit MLPClassifier <field_to_predict> from <explanatory_fields> [into <model name>]
[batch_size=<int>] [max_iter=<int>] [random_state=<int>] [hidden_layer_sizes=<int>-<int>-<int>]
[activation=<str>] [solver=<str>] [learning_rate=<str>]
[tol=<float>} {momentum=<float>]

You can save MLPClassifier models by using the into keyword and apply it to new data later by using the apply command.

You can inspect models learned by MLPClassifier with the summary command.

 ... | summary My_Example_Model

Syntax constraints

  • If My_Example_Model does not exist, the model is saved to it.
  • If My_Example_Model exists and was trained using MLPClassifier, the command updates the existing model with the new input.
  • If My_Example_Model exists but was not trained using MLPClassifier, an error message displays.

Example

The following example uses MLPClassifier on a test set.

... | inputlookup diabetes.csv | fit MLPClassifier response from * into MLP_example_model hidden_layer_sizes='100-100-80' |...

The following example includes the partial_fit command.

| inputlookup iris.csv | fit MLPClassifier species from * partial_fit=true into My_Example_Model

RandomForestClassifier

The RandomForestClassifier algorithm uses the scikit-learn RandomForestClassifier estimator to fit a model to predict the value of categorical fields.

For descriptions of the n_estimators, max_depth, criterion, random_state, max_features, min_samples_split, and max_leaf_nodes parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

Syntax

fit RandomForestClassifier <field_to_predict> from <explanatory_fields> [into <model name>]
[n_estimators=<int>] [max_depth=<int>] [criterion=<gini | entropy>] [random_state=<int>]
[max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]

You can save RandomForestClassifier models using the into keyword and apply new data later using the apply command.

... | apply sla_model

You can list the features that were used to fit the model, as well as their relative importance or influence with the summary command.

 ... | summary sla_model

Example

The following example uses RandomForestClassifier on a test set.

... | fit RandomForestClassifier SLA_violation from * into sla_model | ...

SGDClassifier

The SGDClassifier algorithm uses the scikit-learn SGDClassifier estimator to fit a model to predict the value of categorical fields. This algorithm supports incremental fit.

Parameters

  • The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.
  • The partial_fit parameter default is False.
  • n_iter=<int> is the number of passes over the training data also known as epochs. The default is 5. The number of iterations is set to 1 if using partial_fit.
  • The loss=<hinge|log|modified_huber|squared_hinge|perceptron> parameter is the loss function to be used.
    • Defaults to hinge, which gives a linear SVM.
  • The log loss gives logistic regression, a probabilistic classifier.
  • modified_huber is another smooth loss that brings tolerance to outliers as well as probability estimates.
  • squared_hinge is like hinge but is quadratically penalized.
  • perceptron is the linear loss used by the perceptron algorithm.
  • The fit_intercept=<true|false> parameter specifies whether the intercept should be estimated or not. The default is True.
  • penalty=<l2|l1|elasticnet> is the penalty, also known as regularization term, to be used. The default is l2.
  • learning_rate=<constant|optimal|invscaling> is the learning rate.
    • constant: eta = eta0
    • optimal: eta = 1.0/(alpha * t)
    • invscaling: eta = eta0 / pow(t, power_t)
    • The default is invscaling
  • l1_ratio=<float> is the Elastic Net mixing parameter, with 0 <= l1_ratio <= 1 (default 0.15).
    • l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.
  • alpha=<float> is the constant that multiplies the regularization term (default 0.0001). Also used to compute learning_rate when set to optimal.
  • eta0=<float> is the initial learning rate. The default is 0.01.
  • power_t=<float> is the exponent for inverse scaling learning rate. The default is 0.25.
  • random_state=<int> is the seed of the pseudo random number generator to use when shuffling the data.

Syntax

fit SGDClassifier <field_to_predict> from <explanatory_fields>
[into <model name>] [partial_fit=<true|false>]
[loss=<hinge|log|modified_huber|squared_hinge|perceptron>]
[fit_intercept=<true|false>]
[random_state=<int>] [n_iter=<int>] [l1_ratio=<float>]
[alpha=<float>] [eta0=<float>] [power_t=<float>]
[penalty=<l1|l2|elasticnet>] [learning_rate=<constant|optimal|invscaling>] 

You can save SGDClassifier models using the into keyword and apply the saved model later to new data using the apply command.

... | apply sla_model

You can inspect the model learned by SGDClassifier with the summary command.

 ... | summary sla_model

Syntax constraints

  • If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model.
  • If My_Incremental_Model exists and was trained using SGDClassifier, the command updates the existing model with the new input.
  • If My_Incremental_Model exists but was not trained by SGDClassifier, an error displays.
  • Using partial_fit=true on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead.
  • If partial_fit=false or partial_fit is not specified the model specified is created and replaces the pre-trained model if one exists.

Example

The following example uses SGDClassifier on a test set.

... | fit SGDClassifier SLA_violation from * into sla_model 

The following example includes the partial_fit=<true|false> command.

| inputlookup iris.csv | fit SGDClassifier species from * partial_fit=true into My_Incremental_Model

SVM

The SVM algorithm uses the scikit-learn kernel-based SVC estimator to fit a model to predict the value of categorical fields. It uses the radial basis function (rbf) kernel by default. For descriptions of the C and gamma parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.

Kernel-based methods such as the scikit-learn SVC tend to work best when the data is scaled, for example, using our StandardScaler algorithm: | fit StandardScaler into scaling_model | fit SVM from into svm_model. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

Parameters

  • The gamma parameter controls the width of the rbf kernel. The default value is 1 /number of fields.
  • The C parameter controls the degree of regularization when fitting the model. The default value is 1.0.

Syntax

fit SVM <field_to_predict> from <explanatory_fields> [into <model name>] [C=<float>] [gamma=<float>]

You can save SVM models using the into keyword and apply new data later using the apply command.

... | apply sla_model

Syntax constraints

You cannot inspect the model learned by SVM with the summary command.

Example

The following example uses SVM on a test set.

... | fit SVM SLA_violation from * into sla_model | ...

Clustering Algorithms

Clustering is the grouping of data points. Results will vary depending upon the clustering algorithm used. Clustering algorithms differ in how they determine if data points are similar and should be grouped. For example, the K-means algorithm clusters based on points in space, whereas the DBSCAN algorithm clusters based on local density.

Birch

The Birch algorithm uses the scikit-learn Birch clustering algorithm to divide data points into set of distinct clusters. The cluster for each event is set in a new field named cluster. This algorithm supports incremental fit.

Parameters

  • The k parameter specifies the number of clusters to divide the data into after the final clustering step, which treats the sub-clusters from the leaves of the CF tree as new samples.
    • By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.
  • The partial_fit parameter controls whether an existing model should be incrementally updated on not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.
  • The partial_fit parameter default is False.

Syntax

fit Birch <fields> [into <model name>] [k=<int>][partial_fit=<true|false>] [into <model name>] 

You can save Birch models using the into keyword and apply new data later using the apply command.

... | apply Birch_model

Syntax constraints

  • If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model.
  • If My_Incremental_Model exists and was trained using Birch, the command updates the existing model with the new input.
  • If My_Incremental_Model exists but was not trained by Birch, an error message displays.
  • Using partial_fit=true on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead.
  • If partial_fit=false or partial_fit is not specified the model specified is created and replaces the pre-trained model if one exists.
  • You cannot inspect the model learned by Birch with the summary command.

Examples

The following example uses Birch on a test set.

... | fit Birch * k=3 | stats count by cluster

The following example includes the partial_fit command.

| inputlookup track_day.csv | fit Birch * k=6 partial_fit=true into My_Incremental_Model

DBSCAN

The DBSCAN algorithm uses the scikit-learn DBSCAN clustering algorithm to divide a result set into distinct clusters. The cluster for each event is set in a new field named cluster. DBSCAN is distinct from K-Means in that it clusters results based on local density, and uncovers a variable number of clusters, whereas K-Means finds a precise number of clusters. For example, k=5 finds 5 clusters.

Parameters

  • The eps parameter specifies the maximum distance between two samples for them to be considered in the same cluster.
    • By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.
  • The min_samples parameter defines the number of samples, or the total weight, in a neighborhood for a point to be considered as a core point - including the point itself. You can choose the min_samples parameter's best value based on preference for cluster density or noise in your dataset.
  • The min_samples parameter is optional.
  • The min_samples default value is 5.
  • The minimum value for the min_samples parameter is 3.
  • If min_samples=8 you need at least 8 data points to form a dense cluster.

If you choose the min_samples parameter's best value based on noise in your dataset, it's recommended to have a larger data set to pull from.

Syntax

| fit DBSCAN <fields> [eps=<number>] [min_samples=<integer>]

Syntax constraints

You cannot save DBSCAN models using the into keyword. To predict cluster assignments for future data, combine the DBSCAN algorithm with any classifier algorithm. For example, first cluster the data using DBSCAN, then fit RandomForestClassifier to predict the cluster.

Examples

The following example uses DBSCAN without the min_samples parameter.

... | fit DBSCAN * | stats count by cluster

The following example uses DBSCAN with the min_samples parameter.

...| inputlookup track_day.csv | fit DBSCAN eps=0.5 min_samples=1000 speed | table speed cluster

G-means

G-means is a clustering algorithm based on K-means. The G-means algorithm is similar in purpose to the X-means algorithm. G-means uses the Anderson-Darling statistical test to determine when to split a cluster.

Using the G-means algorithm has the following advantages:

  • The parameter k is computed automatically
  • G-means can produce more accurate clusters than X-means in some real-world scenarios

Parameters

  • The cluster splitting decision is done using the Anderson-Darling statistical test.
  • The cluster for each event is set in a new field named cluster, and the total number of clusters is set in a new field named n_clusters.
  • By default, the cluster label field name is cluster.
    • You can change the default behavior by using the as keyword to specify a different field name.
  • Optionally use the random_state parameter to set a seed value.
    • random_state must be an integer.

Syntax

| fit GMeans <fields> [into <cluster_model>]

You can apply new data to the saved G-means model using the apply command.

 ... | apply cluster_model 

You can save G-means models using the into command. You can inspect the model learned by G-means with the summary command.

...| summary cluster_model 

Example

The following example uses G-means on a test set.

| inputlookup housing.csv
| fields median_house_value distance_to_employment_center crime_rate
| fit GMeans * random_state=42 into cluster_model

K-means

K-means clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable k. The K-means algorithm uses the scikit-learn K-means implementation. The cluster for each event is set in a new field named cluster. Use the K-means algorithm when you have unlabeled data and have at least approximate knowledge of the total number of groups into which the data can be divided.

Using the K-means algorithm has the following advantages:

  • Computationally faster than most other clustering algorithms.
  • Simple algorithm to explain and understand.
  • Normally produces tighter clusters than hierarchical clustering.

Using the K-means algorithm has the following disadvantages:

  • Difficult to determine optimal or true value of k. See X-means.
  • Sensitive to scaling. See StandardScaler.
  • Each clustering may be slightly different, unless you specify the random_state parameter.
  • Does not work well with clusters of different sizes and density.

For descriptions of default value of K, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Parameters

The k parameter specifies the number of clusters to divide the data into. By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

Syntax

fit KMeans <fields> [into <model name>]  [k=<int>]  [random_state=<int>] 

You can save K-means models using the into keyword when using the fit command.

You can apply the model to new data using the apply command.

... | apply cluster_model

You can inspect the model using the summary command.

... | summary cluster_model

Example

The following example uses K-means on a test set.

... | fit KMeans * k=3 | stats count by cluster

SpectralClustering

The SpectralClustering algorithm uses the scikit-learn SpectralClustering clustering algorithm to divide a result set into set of distinct clusters. SpectralClustering first transforms the input data using the Radial Basis Function (rbf) kernel, and then performs K-Means clustering on the result. Consequently, SpectralClustering can learn clusters with a non-convex shape. The cluster for each event is set in a new field named cluster.

Parameters

The k parameter specifies the number of clusters to divide the data into after kernel step. By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

Syntax

fit SpectralClustering <fields> [k=<int>] [gamma=<float>] [random_state=<int>]

Syntax constraints

You cannot save SpectralClustering models using the into keyword. If you want to be able to predict cluster assignments for future data, you can combine the SpectralClustering algorithm with any clustering algorithm. For example, first cluster the data using SpectralClustering, then fit a classifier to predict the cluster using RandomForestClassifier.

Example

The following example uses SpectralClustering on a test set.

... | fit SpectralClustering * k=3 | stats count by cluster

X-means

Use the X-means algorithm when you have unlabeled data and no prior knowledge of the total number of labels into which that data could be divided. The X-means clustering algorithm is an extended K-means that automatically determines the number of clusters based on Bayesian Information Criterion (BIC) scores. Starting with a single cluster, the X-means algorithm goes into action after each run of K-means, making local decisions about which subset of the current centroids should split themselves in order to fit the data better.

Using the X-means algorithm has the following advantages:

  • Eliminates the requirement of having to provide the value of k.
  • Normally produces tighter clusters than hierarchical clustering.

Using the X-means algorithm has the following disadvantages:

  • Sensitive to scaling. See StandardScaler.
  • Different initializations might result in different final clusters.
  • Does not work well with clusters of different sizes and density.

Parameters

  • The splitting decision is done by computing the BIC.
  • The cluster for each event is set in a new field named cluster, and the total number of clusters is set in a new field named n_clusters.
  • By default, the cluster label field name is cluster.
    • You can change the default behavior by using the as keyword to specify a different field name.

Syntax

fit XMeans <fields> [into <model name>]

You can apply new data to the saved X-means model using the apply command.

 ... | apply cluster_model 

You can save X-means models using the into command. You can inspect the model learned by X-means with the summary command.

...| summary cluster_model 

Example

The following example uses X-means on a test set.

... | fit XMeans * | stats count by cluster

Cross-validation

Cross-validation assesses how well a statistical model generalizes on an independent dataset. Cross-validation tells you how well your machine learning model is expected to perform on data that it has not been trained on. There are many types of cross-validation, but K-fold cross-validation (kfold_cv) is one of the most common.

Cross-validation is typically used for the following machine learning scenarios:

  • Comparing two or more algorithms against each other for selecting the best choice on a particular dataset.
  • Comparing different choices of hyper-parameters on the same algorithm for choosing the best hyper-parameters for a particular dataset.
  • An improved method over a train/test split for quantifying model generalization.

Cross-validation is not well suited for time-series charts:

  • In situations where the data is ordered such as time-series, cross-validation is not well suited because the training data is shuffled. In these situations, other methods such as Forward Chaining are more suitable.
  • The most straightforward implementation is to wrap sklearn's Time Series Split. Learn more here: https://en.wikipedia.org/wiki/Forward_chaining

K-fold cross-validation

In the kfold_cv parameter, the training set is randomly partitioned into k equal-sized subsamples. Then, each sub-sample takes a turn at becoming the validation (test) set, predicted by the other k-1 training sets. Each sample is used exactly once in the validation set, and the variance of the resulting estimate is reduced as k is increased. The disadvantage of the kfold_cv parameter is that k different models have to be trained, leading to long execution times for large datasets and complex models.

The scores obtained from K-fold cross-validation are generally a less biased and less optimistic estimate of the model performance than a standard training and testing split.

The image is a representative diagram of how the K-fold parameter works. There are 5 rows representing iterations or folds.  Each fold contains equal subsamples that each take a turn as testing and training data.

You can obtain k performance metrics, one for each training and testing split. These k performance metrics can then be averaged to obtain a single estimate of how well the model generalizes on unseen data.

Syntax

The kfold_cv parameter is applicable to to all classification and regression algorithms, and you can append the command to the end of an SPL search.

Here kfold_cv=<int> specifies that k=<int> folds is used. When you specify a classification algorithm, stratified k-fold is used instead of k-fold. In stratified k-fold, each fold contains approximately the same percentage of samples for each class.

..| fit <classification | regression algo> <targetVariable> from <featureVariables> [options] kfold_cv=<int>

The kfold_cv parameter cannot be used when saving a model.

Output

The kfold_cv parameter returns performance metrics on each fold using the same model specified in the SPL - including algorithm and hyper parameters. Its only function is to give you insight into how well you model generalizes. It does not perform any model selection or hyper parameter tuning.

Examples

The first example shows the kfold_cv parameter used in classification. Where the output is a set of metrics for each fold including accuracy, f1_weighted, precision_weighted, and recall_weighted.

This is a screen capture of the Statistics tab of the toolkit. There are three rows of results displayed.

This second example shows the kfold_cv parameter used in classification. Where the output is a set of metrics for each the neg_mean_squared_error and r^2 folds.

This is a screen capture of the Statistics tab of the toolkit. There are three rows of results displayed.

Feature Extraction

Feature extraction algorithms transform fields for better prediction accuracy.

FieldSelector

The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests. For descriptions of the mode and param parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html.

Parameters

The type parameter specifies if the field to predict is categorical or numeric.

Syntax

fit FieldSelector <field_to_predict> from <explanatory_fields>
[into <model name>] [type=<categorical, numeric>]
[mode=<k_best, fpr, fdr, fwe, percentile>] [param=<int>] 

You can save FieldSelector models using the into keyword and apply new data later using the apply command.

... | apply sla_model

You can inspect the model learned by FieldSelector with the summary command.

 | summary sla_model

Example

The following example uses FieldSelector on a test set.

... | fit FieldSelector type=categorical SLA_violation from * into sla_model | ...

HashingVectorizer

The HashingVectorizer algorithm converts text documents to a matrix of token occurrences. It uses a feature hashing strategy to allow for hash collisions when measuring the occurrence of tokens. It is a stateless transformer, meaning that it does not require building a vocabulary of the seen tokens. This reduces the memory footprint and allows for larger feature spaces.

HashingVectorizer is comparable with the TFIDF algorithm, as they share many of the same parameters. However HashingVectorizer is a better option for building models with large text fields provided you do not need to know term frequencies, and only want outcomes.

For descriptions of the ngram_range, analyzer, norm, and token_pattern parameters, see the scikit-learn documentation at https://scikit-learn.org/0.19/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

Parameters

  • The reduce parameter is either True or False and determines whether or not to reduce the output to a smaller dimension using TruncatedSVD.
  • The reduce parameter default is True.
  • The k=<int> parameter sets the number of dimensions to reduce when the reduce parameter is set to true. Default is 100.
  • The default for the max_features parameter is 10,000.
  • The n_iters parameter specifies the number of iterations to to use when performing dimensionality reduction. This is only used when the reduce parameter is set to True. Default is 5.

Syntax

fit HashingVectorizer <field_to_convert> [max_features=<int>] [n_iters=<int>]
[reduce=<bool>] [k=<int>] [ngram_range=<int>-<int>] [analyzer=<str>] 
[norm=<str>] [token_pattern=<str>] [stop_words=english]

Syntax constraints

HashingVectorizer does not support saving models, incremental fit, or K-fold cross validation.

Example

The following example uses HashingVectorizer to hash the text dataset and applies KMeans clustering (where k=3) on the hashed fields.

| inputlookup authorization.csv | fit HashingVectorizer Logs ngram_range=1-2 k=50 stop_words=english | fit KMeans Logs_hashed* k=3 | fields cluster* Logs | sample 5 by cluster | sort by cluster

ICA

ICA (Independent component analysis) separates a multivariate signal into additive sub-components that are maximally independent. Typically, ICA is not used for separating superimposed signals, but for reducing dimensionality. The ICA model does not include a noise term for the model to be correct, meaning whitening must be applied. Whitening can be done internally using the whiten argument, or manually using one of the PCA variants.

Parameters

  • The n_components parameters determines the number of components ICA uses.
  • The n_components parameter is optional.
  • The n_components parameter default is None. If None is selected, all components are used.
  • Use the algorithm parameter to apply parallel or deflation algorithm for FastICA.
  • The the algorithm parameter default is algorithm='parallel' .
  • Use the whiten parameter to set a noise term.
  • The whiten parameter is optional.
  • If the whiten parameter is False no whitening is performed.
  • The whiten parameter default is True.
  • The max_iter parameter determines the maximum number of iterations during the running of the fit command.
  • The max_iter parameter is optional.
  • The max_iter parameter default is 200.
  • The fun parameter determines the functional form of the G function used in the approximation to neg-entropy.
  • The fun parameter is optional.
  • The fun parameter default is logcosh. Other options for this parameter are exp or cube.
  • The tol parameter sets the tolerance on update at each iteration.
  • The tol parameter is optional.
  • The tol parameter default is 0.0001 .
  • The random_state parameter sets the seed value used by the random number generator.
  • The random_state parameter default is None.
  • If random_state=None then a random seed value is used.

Syntax

fit ICA n_components=<int>, algorithm=<"parallel"|"deflation">, whiten=<bool>, fun=<"logcosh"|"exp"|"cube">, max_iter=<int>, tol=<float>, random_state=<int> <explanatory_fields> [into <model name>]

You can save ICA models using the into keyword and apply new data later using the apply command.

Syntax constraints

You cannot inspect the model learned by ICA with the summary command.

Example

The following example shows how ICA is able to find the two original sources of data from two measurements that have mixes of both. As a comparison, PCA is used to show the difference between the two – PCA is not able to identify the original sources.

| makeresults count=2
| streamstats count as count
| eval time=case(count=2,relative_time(now(),"+2d"),count=1,now())
| makecontinuous time span=15m
| eval _time=time
| eval s1 = sin(2*time)
| eval s2 = sin(4*time)
| eval m1 = 1.5*s1 + .5*s2, m2 = .1*s1 + s2
| fit ICA m1, m2 n_components=2 as IC
| fit PCA m1, m2 k=2 as PC
| fields _time, *
| fields - count, time

KernelPCA

The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. The difference between KernelPCA and PCA is the use of kernels in the former, which helps with finding nonlinear dependencies among the fields. Currently, KernelPCA only supports the Radial Basis Function (rbf) kernel.

For descriptions of the gamma, degree, tolerance, and max_iteration parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html.

Kernel-based methods such as KernelPCA tend to work best when the data is scaled, for example, using our StandardScaler algorithm: | fit StandardScaler into scaling_model | fit KernelPCA into kpca_model. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

Parameters

The k parameter specifies the number of features to be extracted from the data. The other parameters are for fine tuning of the kernel.

Syntax

fit KernelPCA <fields> [into <model name>]
[degree=<int>] [k=<int>] [gamma=<int>]
[tolerance=<int>] [max_iteration=<int>]

You can save KernelPCA models using the into keyword and apply new data later using the apply command.

... | apply user_feedback_model

Syntax constraints

You cannot inspect the model learned by KernelPCA with the summary command.

Example

The following example uses KernelPCA on a test set.

... | fit KernelPCA * k=3 gamma=0.001 | ...

NPR

The Normalized Perlich Ratio (NPR) algorithm converts high cardinality categorical field values into numeric field entries while intelligently handling space optimization. NPR offers low computational costs to perform feature extraction on variables with high cardinalities such as ZIP codes or IP addresses.

NPR does not perform one-hot encoding unlike other algorithms that leverage the fit and apply commands.

Parameters

  • Use the summary command to inspect the variance information of the saved model.
  • After running NPR the transformed dataset has calculated ratios for all feature variables (feature_field). Based on the training data NPR calculates a variable of X_unobserved which can be used as a replacement value in the following two scenarios:
    • In conjunction with the fit command NPR initially replaces missing values in the dataset for feature_field with the keyword unobserved which is then replaced by the calculated NPR value of X_unobserved.
    • In conjunction with the apply command, any new value for target_field that was not visible during model training but is encountered in the test dataset.
  • The number of transformed columns created after running NPR is equal to the number of distinct values for feature_field within the search string.
  • From the saved model, use the variance output field to examine the contribution of a particular feature towards the accuracy of the prediction. Higher variance indicates highly important categorical values whereas low variance indicates the value being of lower importance towards the target prediction. Variance may assist in the process of discarding irrelevant feature variables.

Syntax

fit NPR <target_field> from <feature_field> [into <model name>]

You can couple NPR with existing MLTK algorithms to feed the transformed results to the model as a means to enhance predictions.

| fit NPR <target_field> from <feature_field> | fit SGDClassifier <target_field> from NPR

You can save NPR models using the into keyword and apply new data later using the apply command.

| input lookup disk_failures.csv | tail 1000 | apply npr_disk

You can inspect the model learned by NPR with the summary command.

 | summary npr_disk

Syntax constraints

  • The wildcard (*) character is not supported.
  • The maximum matrix size calculated from |X| * |Y| where X is the feature_field and Y is the target_field is 10000000. For example, if number of distinct categorical feature values are 1000 and distinct categorical target values are 100 then the matrix size is 100000.

Examples

The following example uses NPR on a test set.

| inputlookup disk_failures.csv| head 5000 | fit NPR DiskFailure from Model into npr_disk

The following example couples NPR with another MLTK algorithm on a test set.

| inputlookup disk_failures.csv| head 5000 | fit NPR DiskFailure from Model | fit SGDClassifier DiskFailure from NPR_* random_state=42 n_iter=2 | score accuracy_score DiskFailure against predicted*

The following example uses NPR over multiple fields with additional uses of the fit command.

| inputlookup disk_failures.csv | head 5000 
| fit NPR DiskFailure from Model into npr_disk_1 
| fit NPR DiskFailure from SerialNumber into npr_disk_2

PCA

The Principal Component Analysis (PCA) algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new, uncorrelated features out of the data.

Parameters

  • The k parameter specifies the number of features to be extracted from the data.
  • The variance parameter is short for percentage variance ratio explained. This parameter determines the percentage of variance ratio explained in the principal components of the PCA. It computes the number of principal components dynamically by preserving the specified variance ratio.
  • The variance parameter defaults to 1 if k is not provided.
  • The variance parameter can take a value between 0 and 1. 
  • The explained_variance parameter measures the proportion to which the principal component accounts for dispersion of a given dataset. A higher value denotes a higher variation.
  • The explained_variance_ratio parameter is the percentage of variance explained by each of the selected components.

Syntax

fit PCA <fields> [into <model name>] [k=<int>] [variance=<float>]

You can save PCA models using the into keyword and apply new data later using the apply command.

...into example_hard_drives_PCA_2 | apply example_hard_drives_PCA_2

You can inspect the model learned by PCA with the summary command.

| summary example_hard_drives_PCA_2

Syntax constraints

The variance parameter and k parameter cannot be used together. They are mutually exclusive.

Examples

The following example uses PCA on a test set.

 | fit PCA "SS_SMART_1_Raw", "SS_SMART_2_Raw", "SS_SMART_3_Raw", "SS_SMART_4_Raw", "SS_SMART_5_Raw" k=2 into example_hard_drives_PCA_2

The following example includes the variance parameter. The value variance=0.5 tells the algorithm to choose as many principal components for the data set until able to explain 50% of the variance in the original dataset.

| fit PCA "SS_SMART_1_Raw", "SS_SMART_2_Raw", "SS_SMART_3_Raw", "SS_SMART_4_Raw", "SS_SMART_5_Raw" variance=0.50 into example_hard_drives_PCA_2

TFIDF

The TFIDF algorithm uses the scikit-learn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data. For descriptions of max_df, min_df, ngram_range, analyzer, norm, and token_pattern parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

TFIDF uses memory to create a dictionary of all terms including ngrams and words, and expands the Splunk search events with additional fields per event. If you are concerned with memory limits, consider using the HashingVectorizer algorithm.

Parameters

To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set stop_words to english. For other languages (for example, machine language) you can ignore the common words by setting max_df to a value greater than or equal to 0.7 and less than 1.0.

Syntax

fit TFIDF <field_to_convert> [into <model name>] [max_df=<int>] [min_df=<int>] [ngram_range=<int>-<int>]
[analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]

You can save TFIDF models using the into keyword and apply new data later using the apply command.

... | apply user_feedback_model

Syntax constraints

You cannot inspect the model learned by TFIDF with the summary command.

Example

The following example uses TFIDF to convert the text dataset to a matrix of TF-IDF features and then applies KMeans clustering (where k=3) on the matrix.

| inputlookup authorization.csv | fit TFIDF Logs ngram_range=1-2 ngram_range=1-2 max_df=0.6 min_df=0.2 stop_words=english | fit KMeans Logs_tfidf* k=3 | fields cluster Logs | sample 6 by cluster | sort by cluster

Preprocessing (Prepare Data)

Preprocessing algorithms are used for preparing data. Other algorithms can also be used for preprocessing that may not be organized under this section. For example, PCA can be used for both Feature Extraction and Preprocessing.

Imputer

The Imputer algorithm is a preprocessing step wherein missing data is replaced with substitute values. The substitute values can be estimated, or based on other statistics or values in the dataset. To use Imputer, the user passes in the names of the fields to impute, along with arguments specifying the imputation strategy, and the values representing missing data. Imputer then adds new imputed versions of those fields to the data, which are copies of the original fields, except that their missing values are replaced by values computed according to the imputation strategy.

Parameters

  • Available imputation strategies include mean, median, most frequent, and field. The default strategy is mean.
  • All but the field parameter require numeric data. The field strategy accepts categorical data.

Syntax

.. | fit Imputer <field>* [as <field prefix>] [missing_values=<"NaN"|integer>] [strategy=<mean|median|most_frequent>] [into <model name>]

You can inspect the value (mean, median, or mode) that was substituted for missing values by Imputer with the summary command.

... | summary <imputer model name>

You can save Imputer models using the into keyword and apply new data later using the apply command.

... | apply <imputer model name>

Example

The following example uses Imputer on a test set.

| inputlookup server_power.csv
| eval ac_power_missing=if(random() % 3 = 0, null, ac_power)
| fields - ac_power
| fit Imputer ac_power_missing
| eval imputed=if(isnull(ac_power_missing), 1, 0)
| eval ac_power_imputed=round(Imputed_ac_power_missing, 1)
| fields - ac_power_missing, Imputed_ac_power_missing

RobustScaler

The RobustScaler algorithm uses the scikit-learn RobustScaler algorithm to standardize data fields by scaling their median and interquartile range to 0 and 1, respectively. It is very similar to the StandardScaler algorithm, in that it helps avoid dominance of one or more fields over others in subsequent machine learning algorithms, and is practically required for some algorithms, such as KernelPCA and SVM. The main difference between StandardScaler and RobustScaler is that RobustScaler is less sensitive to outliers.

Parameters

The with_centering and with_scaling parameters specify if the fields should be standardized with respect to their median and interquartile range.

Syntax

fit RobustScaler <fields> [into <model name>] [with_centering=<true|false>] [with_scaling=<true|false>] 

You can save RobustScaler models using the into keyword and apply new data later using the apply command.

... | apply scaling_model

You can inspect the statistics extracted by RobustScaler with the summary command.

... | summary scaling_model

Syntax constraints

RobustScaler does not support incremental fit.

Example

The following example uses RobustScaler on a test set.

... | fit RobustScaler *  | ...

StandardScaler

The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize data fields by scaling their mean and standard deviation to 0 and 1, respectively. This preprocessing step helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms. This step is practically required for some algorithms, such as KernelPCA and SVM. This algorithm supports incremental fit.

Parameters

  • The with_mean and with_std parameters specify if the fields should be standardized with respect to their mean and standard deviation.
  • The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set. The default is False.

Syntax

fit StandardScaler <fields> [into <model name>] [with_mean=<true|false>] [with_std=<true|false>] [partial_fit=<true|false>]

You can save StandardScaler models using the into keyword and apply new data later using the apply command.

... | apply scaling_model

You can inspect the statistics extracted by StandardScaler with the summary command.

...| summary scaling_model

Syntax constraints

  • Using partial_fit=true on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If partial_fit=false or partial_fit is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.
  • If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model.
  • If My_Incremental_Model exists and was trained using StandardScaler, the command updates the existing model with the new input.
  • If My_Incremental_Model exists but was not trained by StandardScaler, an error message is thrown.

Examples

The following example uses StandardScaler on a test set.

... | fit StandardScaler *  | ...

The following example includes the partial_fit parameter.

| inputlookup track_day.csv | fit StandardScaler "batteryVoltage", "engineCoolantTemperature", "engineSpeed" partial_fit=true into My_Incremental_Model

Regressors

Regressor algorithms predict the value of a numeric field.

AutoPrediction

AutoPrediction automatically determines the data type as categorical or numeric. AutoPrediction then invokes the RandomForestRegressor algorithm to carry out the prediction. For further details, see RandomForestRegressor. AutoPrediction also executes the data split for training and testing during the fit process, eliminating the need for a separate command or macro. AutoPrediction uses particular cases to determine the data type, and uses the train_test_split function from sklearn to perform the data split. The kfold cross-validation command can be used with AutoPrediction. See, K-fold_cross-validation.

Parameters

  • Use the target_type parameter to specify the target field as numeric or categorical.
  • The target_type parameter default is auto. When auto is used, AutoPrediction automatically determines the target field type.
  • AutoPrediction uses the following data types to determine the target_type field as categorical:
    • Data of type bool, str, or numpy.object
    • Data of type int and the criterion option is specified
  • AutoPrediction determines the target_type field as numeric for all other cases.
  • The test_split_ratio specifies the splitting of data for model training and model validation. Value must be a float between 0 (inclusive) and 1 (exclusive).
  • The test_split_ratio default is 0. A value of 0 means all data points get used to train the model.
    • A test_split_ratio value of 0.3, for example, means 30% for the data points get used for testing and 70% are used for training.
  • Use n_estimators to optionally specify the number of trees.
  • Use max_depth to optionally set the maximum depth of the tree.
  • Specify the criterion value for classification (categorical) scenarios.
  • Ignore the criterion value for regression (numeric) scenarios.

Syntax

fit AutoPrediction Target from Predictors* into PredictorModel target_type=<auto|numeric|categorical> test_split_ratio=<[0-1]>[n_estimators=<int>] [max_depth=<int>] 
[criterion=<gini | entropy>] [random_state=<int>][max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]

You can save AutoPrediction models using the into keyword and apply the saved model later to new data using the apply command.

 ... | apply PredictorModel

You can inspect the model learned by AutoPrediction with the summary command.

 .... | summary PredictorModel

Syntax constraints

  • AutoPrediction does not support partial_fit.
  • Regression performance output columns for RMSE and rSquared only appear if the target_type is numeric.
  • Classification performance output columns for accuracy, f1, precision, and recall only appear if the target_type is categorical.

Example

The following example uses AutoPrediction on a test set.

| fit AutoPrediction random_state=42 sepal_length from * into auto_regress_model test_split_ratio=0.3 random_state=42

DecisionTreeRegressor

The DecisionTreeRegressor algorithm uses the scikit-learn DecisionTreeRegressor estimator to fit a model to predict the value of numeric fields. The kfold cross-validation command can be used with DecisionTreeRegressor. See, K-fold_cross-validation.

For descriptions of the max_depth, random_state, max_features, min_samples_split, max_leaf_nodes, and splitter parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html.

Parameters

To specify the maximum depth of the tree to summarize, use the limit argument. The default value for the limit argument is 5.

| summary model_DTC limit=10

Syntax

fit DecisionTreeRegressor <field_to_predict> from <explanatory_fields> [into <model_name>]
[max_depth=<int>] [max_features=<str>] [min_samples_split=<int>] [random_state=<int>]
[max_leaf_nodes=<int>] [splitter=<best|random>]

You can save DecisionTreeRegressor models using the into keyword and apply it to new data later using the apply command.

... | apply model_DTR

You can inspect the decision tree learned by DecisionTreeRegressor with the summary command.

... | summary model_DTR

You can get a JSON representation of the tree by giving json=t as an argument to the summary command.

 ... | summary model_DTR json=t

Example

The following example uses DecisionTreeRegressor on a test set.

... | fit DecisionTreeRegressor temperature from date_month date_hour into temperature_model | ...

ElasticNet

The ElasticNet algorithm uses the scikit-learn ElasticNet estimator to fit a model to predict the value of numeric fields. ElasticNet is a linear regression model that includes both L1 and L2 regularization and is a generalization of Lasso and Ridge. The kfold cross-validation command can be used with ElasticNet. See, K-fold_cross-validation.

For descriptions of the fit_intercept, normalize, alpha, and l1_ratio parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html.

Syntax

fit ElasticNet <field_to_predict> from <explanatory_fields>
[into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>] 
[alpha=<int>] [l1_ratio=<int>]

You can save ElasticNet models using the into keyword and apply new data later using the apply command.

... | apply temperature_model

You can inspect the coefficients learned by ElasticNet with the summary command.

 ... | summary temperature_model

Example

The following example uses ElasticNet on a test set.

... | fit ElasticNet temperature from date_month date_hour normalize=true alpha=0.5 | ...

GradientBoostingRegressor

This algorithm uses the GradientBoostingRegressor algorithm from scikit-learn to build a regression model by fitting regression trees on the negative gradient of a loss function. The kfold cross-validation command can be used with GradientBoostingRegressor. See, K-fold_cross-validation.

For further information see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

Syntax

fit GradientBoostingRegressor <field_to_predict> from <explanatory_fields>
[into <model_name>] [loss=<ls|lad|huber|quantile>] 
[max_features=<str>] [learning_rate=<float>] [min_weight_fraction_leaf=<float>] 
[alpha=<float>] [subsample=<float>] [n_estimators=<int>] [max_depth=<int>] 
[min_samples_split=<int>] [min_samples_leaf=<int>] [max_leaf_nodes=<int>]
[random_state=<int>]

You can use the apply method to apply the trained model to the new data.

...|apply temperature_model

You can inspect the features learned by GradientBoostingRegressor with the summary command.

 ... | summary temperature_model

Example

The following example uses the GradientBoostingRegressor algorithm to fit a model and saves that model as temperature_model.

... | fit GradientBoostingRegressor temperature from date_month date_hour into temperature_model | ...

KernelRidge

The KernelRidge algorithm uses the scikit-learn KernelRidge algorithm to fit a model to predict numeric fields. This algorithm uses the radial basis function (rbf) kernel by default. The kfold cross-validation command can be used with KernelRidge. See, K-fold_cross-validation.

For details, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html.

Parameters

The gamma parameter controls the width of the rbf kernel. The default value is 1/ number of fields.

Syntax

fit KernelRidge <field_to_predict> from <explanatory_fields> [into <model_name>] [gamma=<float>]

You can save KernelRidge models using the into keyword and apply new data later using the apply command.

... | apply sla_model

Syntax constraints

You cannot inspect the model learned by KernelRidge with the summary command.

Example

The following example uses KernelRidge on a test set.

... | fit KernelRidge temperature from date_month date_hour into temperature_model

Lasso

The Lasso algorithm uses the scikit-learn Lasso estimator to fit a model to predict the value of numeric fields. Lasso is like LinearRegression, but it uses L1 regularization to learn a linear models with fewer coefficients and smaller coefficients. Lasso models are consequently more robust to noise and resilient against overfitting. The kfold cross-validation command can be used with Lasso. See, K-fold_cross-validation.

For descriptions of the alpha, fit_intercept, and normalize parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html.

Parameters

  • The alpha parameter controls the degree of L1 regularization.
  • The fit_intercept parameter specifies whether the model should include an implicit intercept term. The default value is True.

Syntax

fit Lasso <field_to_predict> from <explanatory_fields>
[into <model name>] [alpha=<float>] [fit_intercept=<true|false>] [normalize=<true|false>]

You can save Lasso models using the into keyword and apply new data later using the apply command.

... | apply temperature_model

You can inspect the coefficients learned by Lasso with the summary command.

 ... | summary temperature_model

Example

The following example uses Lasso on a test set.

... | fit Lasso temperature from date_month date_hour  | ...

LinearRegression

The LinearRegression algorithm uses the scikit-learn LinearRegression estimator to fit a model to predict the value of numeric fields. The kfold cross-validation command can be used with LinearRegression. See, K-fold_cross-validation.

Parameters

The fit_intercept parameter specifies whether the model should include an implicit intercept term. The default value is True.

Syntax

fit LinearRegression <field_to_predict> from <explanatory_fields> [into <model name>
[fit_intercept=<true|false>] [normalize=<true|false>]

You can save LinearRegression models using the into keyword and apply new data later using the apply command.

... | apply temperature_model

You can inspect the coefficients learned by LinearRegression with the summary command.

 ... | summary temperature_model

Example

The following example uses LinearRegression on a test set.

... | fit LinearRegression temperature from date_month date_hour into temperature_model | ..

RandomForestRegressor

The RandomForestRegressor algorithm uses the scikit-learn RandomForestRegressor estimator to fit a model to predict the value of numeric fields. The kfold cross-validation command can be used with RandomForestRegressor. See, K-fold_cross-validation.

For descriptions of the n_estimators, random_state, max_depth, max_features, min_samples_split, and max_leaf_nodes parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

Syntax

fit RandomForestRegressor <field_to_predict> from <explanatory_fields>
[into <model name>] [n_estimators=<int>] [max_depth=<int>] [random_state=<int>]
[max_features=<str>] [min_samples_split=<int>] [max_leaf_nodes=<int>]

You can save RandomForestRegressor models using the into keyword and apply new data later using the apply command.

... | apply temperature_model

You can list the features that were used to fit the model, as well as their relative importance or influence with the summary command.

 ... | summary temperature_model

Example

The following example uses RandomForestRegressor on a test set.

... | fit RandomForestRegressor temperature from date_month date_hour into temperature_model | ...

Ridge

The Ridge algorithm uses the scikit-learn Ridge estimator to fit a model to predict the value of numeric fields. Ridge is like LinearRegression, but it uses L2 regularization to learn a linear models with smaller coefficients, making the algorithm more robust to collinearity. The kfold cross-validation command can be used with Ridge. See, K-fold_cross-validation.

For descriptions of the fit_intercept, normalize, and alpha parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html.

Parameters

The alpha parameter specifies the degree of regularization. The default value is 1.0.

Syntax

  fit Ridge <field_to_predict> from <explanatory_fields>
[into <model name>] [fit_intercept=<true|false>] [normalize=<true|false>]
[alpha=<int>]

You can save Ridge models using the into keyword and apply new data later using the apply command.

... | apply temperature_model

You can inspect the coefficients learned by Ridge with the summary command.

 ... | summary temperature_model

Example

The following example uses Ridge on a test set.

... | fit Ridge temperature from date_month date_hour normalize=true alpha=0.5 | ...

SGDRegressor

The SGDRegressor algorithm uses the scikit-learn SGDRegressor estimator to fit a model to predict the value of numeric fields. The kfold cross-validation command can be used with SGDRegressor. See, K-fold_cross-validation. This algorithm supports incremental fit.

Parameters

  • The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set. The default is False.
  • The fit_intercept=<true|false> parameter determines whether the intercept should be estimated or not.
  • The fit_intercept=<true|false> parameter default is True.
  • The n_iter=<int> parameter is the number of passes over the training data also known as epochs. The default is 5.
    • The number of iterations is set to 1 if using partial_fit.
  • The penalty=<l2|l1|elasticnet> parameter set the penalty or regularization term to be used. The default is l2.
  • The learning_rate=<constant|optimal|invscaling> parameter is the learning rate.
    • constant: eta = eta0
    • optimal: eta = 1.0/(alpha * t)
    • invscaling: eta = eta0 / pow(t, power_t)
    • default is invscaling.
  • The l1_ratio=<float> parameter is the Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. Default is 0.15.
    • l1_ratio=0 corresponds to L2 penalty
    • l1_ratio=1 to L1
  • The alpha=<float> parameter is the constant that multiplies the regularization term. Default is 0.0001.
    • Also used to compute learning_rate when set to Optimal.
  • The eta0=<float> parameter is the initial learning rate. Default is 0.01.
  • The power_t=<float> parameter is the exponent for inverse scaling learning rate. Default is 0.25.
  • The random_state=<int> parameter is the seed of the pseudo random number generator to use when shuffling the data.

Syntax

fit SGDRegressor <field_to_predict> from <explanatory_fields>
[into <model name>] [partial_fit=<true|false>] [fit_intercept=<true|false>]
[random_state=<int>] [n_iter=<int>] [l1_ratio=<float>]
[alpha=<float>] [eta0=<float>] [power_t=<float>]
[penalty=<l1|l2|elasticnet>] [learning_rate=<constant|optimal|invscaling>]

You can save SGDRegressor models using the into keyword and apply new data later using the apply command.

... | apply temperature_model

You can inspect the coefficients learned by SGDRegressor with the summary command.

 ... | summary temperature_model

Syntax constraints

  • If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model.
  • If My_Incremental_Model exists and was trained using SGDRegressor, the command updates the existing model with the new input.
  • If My_Incremental_Model exists but was not trained by SGDRegressor, an error message displays.
  • Using partial_fit=true on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead.
  • If partial_fit=false or partial_fit is not specified the model specified is created and replaces the pre-trained model if one exists.

Examples

The following example uses SGDRegressor on a test set.

... | fit SGDRegressor temperature from date_month date_hour into temperature_model | ...

The following example includes thepartial_fit parameter.

| inputlookup server_power.csv | fit SGDRegressor "ac_power" from "total-cpu-utilization" "total-disk-accesses" partial_fit=true into My_Incremental_Model

System Identification

Use the System Identification algorithm to model both non-linear and linear relationships. In a typical use case you predict a number of target fields from their past values as well as from the past and current values of other feature fields. The System Identification algorithm is powered by a multi-layered, fully-connected neural network. System Identification supports incremental fit.

The System Identification algorithm only works with numeric field values.

Parameters

  • The wildcard character is supported within target and feature fields.
  • Use the dynamics parameter to specify the amount of lag to be used for each variable.
    • The dynamics parameter is required.
    • Must be a list of non-negative integers separated by hyphens.
    • The number of non-negative integers listed must equal the number of target variables plus the number of feature variables.
    • The non-negative integers align with target and feature variables based on the order in which they are written.
    • One dynamic value is matched with each wildcard and the same amount applies to all fields matched by that wildcard.
  • The conf_interval parameter specifies the confidence interval percentage for the prediction.
    • Value must be between 1 and 99.
    • A larger number means a greater tolerance for prediction uncertainty.
    • Default value is 95.
    • The conf_interval number used with the fit command does not need to be the same number used with the apply command.
  • The layers option specifies the number of hidden layers and their sizes in the neural network.
    • Must be a list of positive integers separated by hyphens.
    • Option defaults to 64-64 for two layers, each of a size of 64.
  • Use the epochs option to specify the number of iterations during training.
    • Must be a positive integer.
    • Default value for epochs is 500.

Syntax

| fit SystemIdentification <target-fields> from <feature-fields> dynamics=<int-int-...> [conf_interval=<int>] [layers=<int-int-...>] [epochs=<int>] [into <model-name>]

You can apply the saved model to new data with the apply command.

| apply <model-name> [conf_interval=<int>]

You can inspect the model learned by System Identification with the summary command.

| summary <model-name>

Syntax constraints

System identification cannot be used with K-fold cross validation.

Examples

The following example uses three lags of Expenses, two lags of HR1, two lags of HR2, and three lags of ERP.

| inputlookup app_usage.csv | fit SystemIdentification Expenses from HR1 HR2 ERP dynamics=3-2-2-3

The following example uses three lags of Expenses, two lags of all fields that starts with HR, and three lags of ERP.

| inputlookup app_usage.csv | fit SystemIdentification Expenses from HR* ERP dynamics=3-2-3

The following example uses a fully-connected neural network with three hidden layers, each with a layer size of 64. The total number of layers in the neural network is five and comprised of one input layer, three hidden layers, and one output layer.

| inputlookup app_usage.csv | fit SystemIdentification Expenses from HR1 HR2 ERP dynamics=3-1-2-3 layers=64-64-64

The following example uses System Identification on a test set.

This image shows the results from running the System Identification algorithm on a test set. The Visualizations tab of the Machine Learning Toolkit is displayed.

Time Series Analysis

Forecasting algorithms, also known as time series analysis, provide methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data, and forecast its future values.

ARIMA

The Autoregressive Integrated Moving Average (ARIMA) algorithm uses the StatsModels ARIMA algorithm to fit a model on a time series for better understanding and/or forecasting its future values. An ARIMA model can consist of autoregressive terms, moving average terms, and differencing operations. The autoregressive terms express the dependency of the current value of time series to its previous ones.

The moving average terms, also called random shocks or white noise, model the effect of previous forecast errors on the current value. If the time series is non-stationary, differencing operations are used to make it stationary. A stationary process is a stochastic process in that its probability distribution does not change over time.

See the StatsModels documentation at http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.arima_model.ARIMA.html for more information.

It is highly recommended to send the time series through timechart before sending it into ARIMA to avoid non-uniform sampling time. If _time is not to be specified, using timechart is not necessary.

Parameters

  • The time series should not have any gaps or missing data otherwise ARIMA will complain. If there are missing samples in the data, using a bigger span in timechart or using streamstats to fill in the gaps with average values can do the trick.
  • When chaining ARIMA output to another algorithm (i.e. ARIMA itself), keep in mind the length of the data is the length of the original data + forecast_k. If you want to maintain the holdback position, you need to add the number in forecast_k to your holdback value.
  • ARIMA requires the order parameter to be specified at fitting time. The order parameter needs three values:
    • Number of autoregressive (AR) parameters
    • Number of differencing operations (D)
    • Number of moving average (MA) Parameters
  • The forecast_k=<int> parameter tells ARIMA how many points into the future should be forecasted. If _time is specified during fitting along with the field_to_forecast, ARIMA will also generate the timestamps for forecasted values. By default, forecast_k is zero.
  • The conf_interval=<1..99> parameter is the confidence interval in percentage around forecasted values. By default it is set to 95%.
  • The holdback=<int> parameter is the number of data points held back from the ARIMA model. This is useful for comparing the forecast against known data points. By default, holdback is zero.

Syntax

fit ARIMA [_time] <field_to_forecast>  order=<int>-<int>-<int> [forecast_k=<int>] [conf_interval=<int>] [holdback=<int>]

Syntax constraints

  • ARIMA supports one time series at a time.
  • ARIMA models cannot be saved and used at a later time in the current version.
  • Scoring metric values are based on all data and not on the holdback period data.

Example

The following example uses ARIMA on a test set.

... | fit ARIMA Voltage order=4-0-1 holdback=10 forecast_k=10

StateSpaceForecast

StateSpaceForecast is a forecasting algorithm for time series data in the MLTK. It is based on Kalman filters. The algorithm supports incremental fit.

Advantages of StateSpaceForecast over ARIMA include:

  • Persists models created using the fit command that can then be used with apply.
  • A specialdays field allows you to account for the effects of a specified list of special days.
  • It is automatic in that you do no need to choose parameters or mode.
  • Supports multivariate forecasting.

Parameters

  • By default the historical data results from running the fit command are not shown. To modify this behavior set output_fit=True.
  • The fields segment of the search supports the wildcard (*) character.
  • Use the target field to specify fields from which to forecast using historical data and other values.
  • The target field is a comma-separated list of fields that can be univariate or multivariate. These fields must be specified during the fit process.
    • Optionally use the target field to fit multiple fields during the fit process but apply only a selection of those target fields during the apply process.
  • If the target field is not specified, then all fields will be forecast together using historical data.
  • The specialdays field specifies the field that indicates effects due to special days such as holidays.
  • The specialdays field values must be numeric and are typically 0 and 1, with 1 indicating the existence of a special day effect. Null values are treated as 0.
  • The majority of use cases have no specialdays. Events that occur regularly and frequently such as weekends should not be treated as specialdays. Use specialdays to capture events such as holiday sales.
  • Use specialdays in the apply step if it has been specified during fit. The same field(s) must be assigned.
  • Use the period parameter to specify if your data has a known periodicity.
  • If the period parameter is not specified it is computed automatically.
  • Set period=1 to treat the time series as non-periodic.
  • As with other MLTK algorithms, the partial_fit parameter controls whether a model should be incrementally updated or not. This allows you to update a model using only new data without having to retrain the model on the full dataset.
  • The default for partial_fit is False.
  • Use update_last to modify the behavior of partial_fit
  • The default for update_last is False.
  • If partial_fit=True StateSpaceForecast first updates the model parameters and then predicts.
  • If partial_fit=True and update_last=True StateSpaceForecast first predicts and then updates the model parameters. This allows you to review the forecast before running new data through.
  • The conf_interval=<1..99> parameter is the confidence interval in percentage around forecasted values. Input an integer between 1 and 99 where a larger number means a greater tolerance for forecast uncertainty. The default integer is 95.
  • Use the as field to assign aliases to forecasted fields.
  • In univariate cases the as field field-list is a single field name.
  • In multivariate cases, the as field adheres to the following conventions:
    • The list must be in double quotes, separated by either spaces or commas.
    • The aliases correspond to the original fields in the given order.
    • The number of aliases can be smaller than the number of original fields.
  • The summary command lists the names of the fields used in the fit command step, the name of the specialdays field, and the period.
  • The holdback parameter is the number of data points held back from training. This is useful for comparing the forecast against known data points. Default holdback value is 0.
  • If you want to maintain the holdback position, add the position number in forecast_k to your holdback value.
  • The forecast_k parameter tells StateSpaceForecast how many points into the future should be forecasted. If _time is specified during fitting along with the field_to_forecast, StateSpaceForecast also generates the timestamps for forecasted values. Default, forecast_k value is 0.
  • The holdback and forecast_k values can be of two types: an integer or a time range.
    • An integer specifies a number of events. An example of forecast_k=10 forecasts 10 events into the future. An example of holdback=10 withholds the last 10 events from training.
    • A time range takes the form XY where X is a non-negative integer and Y is either empty or adheres to format in the time range table. If Y is empty, then the time range is instead interpreted as an integer or a number of events. An example of holdback=3day forecast_k=1week withholds 3 days of events and forecasts 1 week's worth of events.

The actual number of events withheld and forecasted using the time range option depends on the time interval between consecutive events.

Time range Acceptable formats for Y value
seconds s, sec, secs, second, seconds
minutes m, min, minute, minutes
hours h, hr, hrs, hour, hours
days d, day, days
weeks w, week, weeks
months mon, month, months
quarters q, qtr, qtrs, quarter, quarters
years y, yr, yrs, year, years

Syntax

| fit StateSpaceForecast <fields> [from *] [specialdays=<field name>] [holdback=<int | time-range>] [forecast_k=<int | time-range>] [conf_interval=<float>] [period=<int>]
[partial_fit=<true|false>] [update_last=<true|false>] [output_fit=<true|false>] [into <model name>] [as <field-list>]

You can apply the saved model to new data with the apply command.

| apply <model name> [specialdays=<field name>] [target=<fields>] [holdback=<int | time-range>] [forecast_k=<int | time-range>] [conf_interval=<float>]

You can inspect the model learned by StateSpaceForecast with the summary command.

| summary <model name>

Syntax constraints

  • For univariate analysis the fields parameter is a single field, but for multivariate analysis it is a list of fields.
  • For multivariate analysis, only one specialdays field can be specified and it applies to all the fields.
  • The specialdays field values must be numeric.
  • Null values in the specialdays field are treated as 0.
  • Double quotes are required around field lists.
  • Scoring metric values are based on all data and not on the holdback period data.

Examples

The following is a univariate example of StateSpaceForecast on a test set. The example is considered univariate as there is only a single field following | fit StateSpaceForecast. The example dataset is derived from the milk.csv dataset that ships with the MLTK. The milk2.csv has a new column named holiday. This column has two values 0 and 1. The 0 value represents no holiday and 1 value represents a holiday for the associated date. The 1 values were set randomly.

| inputlookup milk2.csv
| fit StateSpaceForecast milk_production from * specialdays=holiday into milk_model
| apply milk_model specialdays=holiday forecast_k=30

The following is a multivariate example of StateSpaceForecast on a test set. The syntax is the same as that in the univariate example, except that this case has a list of fields (CRM, ERP, and Expenses) following | fit StateSpaceForecast, making it multivariate.

| inputlookup app_usage.csv
| fields CRM ERP Expenses
| fit StateSpaceForecast CRM ERP Expenses holdback=12 into app_usage_model as "crm, erp"

The following example is also multivariate and includes the target field. In this example the fields of CRM and ERP are forecast using historical data and the Expenses field. The apply command is used against the model created in the fit command step, resulting in the app_usage_model model.

Double quotes are required around any field list.

| inputlookup app_usage.csv
| fields CRM ERP Expenses
| apply app_usage_model target="CRM, ERP" forecast_k=36 holdback=36

The following example is again multivariate but without the target field. This example forecasts the fields CRM, ERP, and Expenses using historical data.

| inputlookup app_usage.csv
| fields CRM ERP Expenses
| apply app_usage_model forecast_k=36 holdback=36

The following example uses the wildcard (*) character to specify the three fields of total_accidents, front_accidents, and rear_accidents.

| inputlookup UKfrontrearseatKSI.csv
| eval total_accidents='British drivers KSI'
| eval front_accidents='front seat KSI'
| eval rear_accidents='rear seat KSI'
| fit StateSpaceForecast *accidents holdback=30 from * forecast_k=10

The following example shows how to improve your output with StateSpaceForecast.

| inputlookup cyclical_business_process_with_external_anomalies.csv
| eval holiday=if(random()%100<98,0,1)
| fit StateSpaceForecast logons from logons into My_Model forecast_k=3000

Adding of the SPL line period=2016 could improve the output, but would not account for the period being seven days rather than twenty-four hours.

| inputlookup cyclical_business_process_with_external_anomalies.csv
| table _time,logons
| eval holiday=if(random()%100<98,0,1)
| eval dayOfWeek=strftime(_time,"%a")
| eval holidayWeekend=case(in(dayOfWeek,"Sat","Sun"),1,true(),0)
| apply MyBadModel specialdays=holidayWeekend forecast_k=3000
| eval old_predict='predicted(logons)'
| eval dayOfWeek=strftime(_time,"%a")
| eval holidayWeekend=case(in(dayOfWeek,"Sat","Sun"),1,true(),0)
| apply My_Model specialdays=holidayWeekend holdback=3000 forecast_k=3000

Utility Algorithms

These utility algorithms are not machine learning algorithms, but provide methods to calculate data characteristics. These algorithms facilitate the process of algorithm selection and parameter selection. See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html for more information.

ACF (autocorrelation function)

ACF (autocorrelation function) calculates the correlation between a sequence and a shifted copy of itself, as a function of shift. Shift is also referred to as lag or delay.

Parameters

  • The k parameter specifies the number of lags to return autocorrelation for. By default k is 40.
  • The fft parameter specifies whether ACF is computed via Fast Fourier Transform (FFT). By default fft is False.
  • The conf_interval parameter specifies the confidence interval in percentage to return. By default conf_interval is set to 95.

Syntax

fit ACF <field> [k=<int>] [fft=true|false] [conf_interval=<int>]

Example

The following example uses ACF (autocorrelation function) on a test set.

... | fit ACF logins k=50 fft=true conf_interval=90

PACF (partial autocorrelation function)

PACF (partial autocorrelation function) gives the partial correlation between a sequence and its lagged values, controlling for the values of lags that are shorter than its own. See the StatsModels documentation at http://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.pacf.html for more information.

Parameters

  • The k parameter specifies the number of lags to return partial autocorrelation for. By default k is 40.
  • The method parameter specifies which method for the calculation to use. By default method is unbiased.
  • The conf_interval parameter specifies the confidence interval in percentage to return. By default conf_interval is set to 95.

Syntax

fit PACF <field> [k=<int>] [method=<ywunbiased|ywmle|ols>] [conf_interval=<int>]

Example

The following example uses PACF (partial autocorrelation function) on a test set.

... | fit PACF logins k=20 conf_interval=90
Last modified on 17 August, 2023
PREVIOUS
Custom visualizations in the Machine Learning Toolkit
  NEXT
Import a machine learning algorithm from Splunkbase

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.2.0, 5.2.1, 5.2.2, 5.3.0


Was this documentation topic helpful?


You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters