Splunk® Machine Learning Toolkit

User Guide

This documentation does not apply to the most recent version of Splunk® Machine Learning Toolkit. For documentation on the most recent version, go to the latest release.

Search commands for machine learning

The Machine Learning Toolkit contains several custom search commands that implement classic machine learning and statistical learning tasks:

  • fit: Fit and apply a machine learning model to search results.
  • apply: Apply a machine learning model that was learned using the fit command.
  • summary: Return a summary of a machine learning model that was learned using the fit command.
  • listmodels: Return a list of machine learning models that were learned using the fit command.
  • deletemodel: Delete a machine learning model that was learned using the fit command.
  • sample: Randomly sample or partition events.


You can use these custom search commands on any Splunk platform instance on which the Machine Learning Toolkit is installed.

Download the ML-SPL Quick Reference Guide (also available in Japanese) for a handy cheat sheet of ML-SPL commands and machine learning algorithms used in the Machine Learning Toolkit.

fit

Use the fit command to fit and apply a machine learning model to search results.

Syntax

   fit <algorithm> (<option_name>=<option_value>)* (<algorithm-arg>)+ (into <model_name>)? (as <output_field>)?

The first argument, which is required, is the algorithm to use. The following algorithms are available by default:

  • LinearRegression
  • LogisticRegression
  • SVM
  • PCA
  • KMeans
  • DBSCAN

All algorithms require a list of fields to use when learning a model. For classification and regression algorithms, the first field is the field to predict (i.e. response field). Subsequent fields are the fields to use when making predictions (i.e. explanatory fields). You can optionally follow the response field with the from keyword, which is discarded. For unsupervised learning algorithms, the fields are the fields to learn the model on (i.e. for clustering, which fields to cluster over).

Use the as keyword to rename the field added to search results by the model.

Use the into keyword to store the learned model in an artifact that can later be applied to new search results with the apply command. Not all algorithms support saved models.

Some algorithms support options that can be given as name = value arguments. For example, KMeans and PCA both support a k option that specifies how many clusters or how many principal components to learn.

You can also configure the fit command. See the Configure the fit and apply commands topic below.

Examples

Fit a LinearRegression model to predict errors using _time:

   ... | fit LinearRegression errors from _time

Fit a LinearRegression model to predict errors using _time and save it into a model named errors_over_time:

   ... | fit LinearRegression errors from _time into errors_over_time

Fit a LogisticRegression model to predict a categorical response from numerical measurements:

   ... | fit LogisticRegression species from petal_length petal_width sepal_length sepal_width

apply

Use the apply command to compute predictions for the current search results based on a model that was learned by the fit command. The apply command can be used on different search results than those used when fitting the model, but the results should have an identical list of fields.

You can also configure the apply command. See the Configure the fit and apply commands topic below.

Syntax

   apply <model_name> (as <output_field>)?

Use the as keyword to rename the field added to search results by the model.

Examples

Apply a learned LinearRegression model, "errors_over_time":

   ... | apply errors_over_time

Rename the output of the model to "predicted_errors":

   ... | apply errors_over_time as predicted_errors

summary

Use the summary command to return a summary of a machine learning model that was learned using the fit command. The summary is algorithm specific. For example, the summary for the LinearRegression algorithm is a list of coefficients. The summary for the LogisticRegression algorithm is a list of coefficients for each class.

Syntax

   summary <model_name>

Examples

Inspect a learned LinearRegression model "errors_over_time":

   | summary errors_over_time

listmodels

Use the listmodels command to return a list of machine learning models that were learned using the fit command. The algorithm and arguments given when fit was invoked are displayed for each model.

Syntax

   listmodels

Examples

List all models:

   | listmodels


deletemodel

Use the deletemodel command to delete a machine learning model learned using the fit command.

Syntax

   deletemodel <model_name>

Examples

Delete the "errors_over_time" model:

   | deletemodel errors_over_time

sample

Use the sample command to randomly sample or partition events.

Sampling modes:

  • ratio: A float between 0 and 1 indicating the probability as a percentage that each event has of being included in the result set. For example, a ratio of 0.01 means that events have a 1% probability of being included in the results. Use ratio when you want an approximation.
  • You can omit the ratio keyword, for example use | sample ratio=0.01 or | sample 0.01.

  • count: A number that indicates the exact number of randomly-chosen events to return. If the sample count exceeds the total number of events in the search, all events are returned.
  • You can omit the count keyword, for example use | sample count=10 or | sample 10.

  • proportional: The name of a numeric field to use to determine the sampling probability of each event, which yields a biased sampling. Each event is sampled with a probability specified by this field value.


Partitioning mode:

  • partitions: The number of partitions in which to randomly divide events, approximately split. Use partitions when you want to divide your results into groups for different purposes, such as using results for testing and training.


Additional options:

  • seed: A number that specifies a random seed. Using seed ensures reproducible results. If unspecified, a pseudorandom value is used.
  • by <field>: Used with count. Specifies a field by which to split events, returning the count number of events for each value of the specified field. If there are more events than count, all events are included in the results.
  • inverse: Used with proportional. Inverts the probablity, returning samples with one minus the probability specified in the proportional field.
  • fieldname: The name of the field in which to store the partition number. Defaults to "partition_number".


This sample command is not identical to using sampling options on the Event Sampling menu on the Search page in Splunk Web:

  • Options from the Event Sampling menu perform sampling before the data is collected from indexes, at the beginning of the search pipeline.
  • The sample command is applied after data is collected, accessing everything in the search pipeline.


Using the Event Sampling menu options is faster, but the sample command can be used anywhere in the search command and provides several modes that are not available to the Event Sampling feature. For example, the sample command supports partitioning, biased sampling, and the ability to retrieve an exact number of results.

Syntax

   sample (ratio=<float between 0 and 1>)? (count=<positive integer>)? (proportional=<name of numeric field> (inverse)?)? (partitions=<natural number greater than 1> (fieldname=<string>)?)? (seed=<number>)? (by <split_by_field>)?

Examples

Retrieve approximately 1% of events at random:

   ... | sample ratio=0.01
   ... | sample 0.01

Retrieve exactly 20 events at random:

   ... | sample count=20
   ... | sample 20
   

Retrieve exactly 20 events at random from each host:

   ... | sample count=20 by host

Return each event with a probability determined by the value of "some_field":

   ... | sample proportional="some_field"

Partition events into 7 groups, with the chosen group returned in a field called "partition_number":

   ... | sample partitions=7 fieldname="partition_number"

Configure the fit and apply commands

You can configure the fit and apply commands by setting properties in the mlspl.conf configuration file (located under $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/default/). You can specify default settings for all algorithms, or for individual algorithms.

Note: To avoid losing your changes to the configuration file when you upgrade the app, create a copy of the mlspl.conf file with only the modified stanzas and settings, then save it to $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/local/. For details, see How to copy and edit a configuration file.

Setting Default Description
max_inputs 100000 The maximum number of events an algorithm considers when fitting a model. If this limit is exceeded and use_sampling is true, the fit command downsamples its input using the Reservoir Sampling algorithm before fitting a model. If use_sampling is false and this limit is exceeded, the fit command throws an error.
use_sampling true Indicates whether to use Reservoir Sampling for data sets that exceed max_inputs or to instead throw an error.
max_fit_time 600 The maximum time, in seconds, to spend in the "fit" phase of an algorithm. This setting does not relate to the other phases of a search (e.g. retrieving events from an index).
max_memory_usage_mb 1000 The maximum allowed memory usage, in megabytes, by the fit command while fitting a model.
max_model_size_mb 15 maximum allowed size of a model, in megabytes, created by the fit command. Some algorithms (e.g. SVM and RandomForest) might create unusually large models, which can lead to performance problems with bundle replication.
Last modified on 07 April, 2017
The basic process of machine learning   Algorithms

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 2.0.1, 2.1.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters