Splunk® Machine Learning Toolkit

User Guide

Download manual as PDF

Download topic as PDF

Using the fit and apply commands

The Machine Learning Toolkit contains several custom search commands, referred to as ML-SPL commands, that implement classic machine learning and statistical learning tasks. You can use these custom search commands on any Splunk platform instance on which the Machine Learning Toolkit is installed. The fit and apply search commands train and fit a machine learning model, also known as a a learned model, based on the chosen algorithm.

At the highest level the fit and apply commands operate as follows:

  • Use the fit command to produce a machine learning model based on the behavior of a set of events.
  • The fit command applies the machine learning model to the current set of search results in the search pipeline.
  • Use the apply command to apply the machine learning model that was learned using the fit command.
  • The apply command repeats a selection of the fit command steps.

Before training your model, your data may require preprocessing. To learn your data preprocessing options, see Preparing your data for machine learning and Preprocessing machine data using MLTK Assistants.

The examples in this document are based on a fictional shop and use a synthetic dataset from various source types. This example dataset does not ship with the Machine Learning Toolkit. The goal of this example is to predict the value of field_A based on the available data in the dataset. A prediction output is just one example of a machine learning outcome using the fit and apply commands.

Steps for the fit command

The Machine Learning Toolkit performs these steps when running the fit command:

  1. Search results pull into memory.
  2. Transform search results using data preparation actions:
    1. Discard any fields that are null throughout all the events.
    2. Discard non-numeric fields with more than (>) 100 distinct values.
    3. Discard events with any null fields.
    4. Convert non-numeric fields into dummy variables using one-hot encoding.
    5. Convert the prepared data into a numeric matrix and run the specified algorithm to create a model.
  3. Apply the model to the prepared data and produce new columns that display the prediction.
  4. The learned machine learning model is encoded and saved as a knowledge object.

1. Search results pull into memory

When you run a search, the fit command pulls the search results into memory, creates a copy of the search results, and parses the search results into Pandas DataFrame format. The originally ingested data is not changed.

2. Transform search results using data preparation actions

The data must be properly prepared to be suitable for machine learning and running though the selected algorithm. The following actions all take place on the search results copy.

a) Discard any fields that are null throughout all the events

The fit command discards fields that contain no values.

The following example demonstrates how the fit command looks for incidents of fraud within a dataset. The example shows a simplified visual representation of the search results. In this example field_C is highlighted for removal because there are no values in this field.

This image shows a table with five columns and six rows. The first row contains the field labels from A to E. The remaining rows show search results. The column for field C is highlighted to emphasize that there are no values.

If you do not want null fields to be removed from the search results you must change your search. For example, to replace the null values with 0 in the results for field_C, use the SPL fillnull command. You must specify the fillnull command before the fit command, as shown in the following search example:

... | fillnull field_C | fit LogisticRegression field_A from field_*

b) Discard non-numeric fields with more than (>) 100 distinct values.

The fit command discards non-numeric fields if the fields have more than 100 distinct values. In machine learning, many algorithms do not perform well with high-cardinality fields, because every unique, non-numeric entry in a field becomes an independent feature. A high-cardinality field can lead to an explosion in feature space very quickly.

In the MLTK, IP numbers are interpreted as non-numeric or string values. In this example, none of the fields have a non-numeric field with more than 100 distinct values, so no action is taken. Had the search results included more than 100 distinct Internet Protocol (IP) addresses in field_E it would qualify as high-cardinality.

This image shows the same table. Column C is removed. The column for field E is highlighted to emphasize that it is a high-cardinality field.

An alternative to discarding fields is to use the values to generate a usable feature set. For example, by using SPL commands such as streamstats or eventstats, you can calculate the number of times an IP address occurs in your search results. You must generate these calculations in your search before the fit command. In this scenario the high-cardinality field is removed by the fit command, but the field that contains the generated calculations remains.

The limit for distinct values is set to 100 by default. You can change the limit by changing the max_distinct_cat_values attribute in your local copy of the mlspl.conf file. See Configure the fit and apply commands for details on updating the mlspl.conf file attributes.

  • Only users with file system access, such as system administrators, can make changes to the mlspl.conf file.
  • Refer to the Splunk Admin Manual to review the steps for How to edit a configuration file.

Do not change or copy the configuration files in the default directory. The files in the default directory must remain intact and in their original location. Make the changes in the local directory.

c) Discard events with any null fields

To train a model, the machine learning algorithm requires all of the search results to have a value. Any null value means the entire event will not contribute towards the learned model. In step (a), the fit command example dropped every column that is entirely null, and now it drops every event (row) that has one or more null fields.

This screen image shows the same table. The last two rows are highlighted to indicate that these rows will be discarded because there are one or more null field values in those rows.

As an alternative to dropping every row with one or more null fields, you can specify that any search results with null values be included in the learned model. Choose to replace null values if you want the algorithm to learn from an example with a null value and to return an empty collection. Or choose to replace null values if you want the algorithm to learn from an example with a null value and to throw an exception.

To include the results with null values in the model, you must replace the null values before using the fit command in your search. You can replace null values by using SPL commands such as fillnull, filldown, or eval.

d) Convert non-numeric fields into dummy variables using one-hot encoding

The fit command converts fields that contain strings or characters into numbers. Algorithms perform best with numeric data, not categorical data. The fit command converts non-numeric fields to binary indicator variables (1 or 0) using one-hot encoding.

This screen image shows the same table. The column for field C with null values and the column for field E with high-cardinality results have been removed. The column for field A has the values of ok or FRAUD. The column for field B has numeric values. The column for field D has color names.

One-hot encoding encodes categorical values as binary values (1 or 0). In this example the strings and characters infield_D get converted to three fields: field_D=red, field_D=green, field_D=blue. The following example shows the results of one-hot encoding. The values for these new fields are either 1 or 0. The value of 1 appears where the color name appeared previously.

This screen shows the same table, with the column for field D now converted into three columns. Each of the new columns represents a color as a binary value. The names of the new fields are: field D equals red, field D equals green, and field D equals blue.

If you want more than 100 values per field, you can use one-hot encoding with SPL commands before using the fit command. In the following example, SPL is used to code search results without limiting values to 100 values per field:

| eval {field_D}=1 | fillnull 0

e) Convert the prepared data into a numeric matrix and run the specified algorithm to create a model

The data is now in a clean, numeric matrix that's ready to be processed by the selected algorithm and trained to become the machine learning model. A temporary model is created in memory.

3. Apply the model to the prepared data and produce new columns that display the prediction

The fit command applies the temporary model to the prepared data. In this example, the model is applied to each search result to predict values, including the search results with null values. The fit command appends one or more columns to the results. The appended search results are then returned to the search pipeline.

The following image shows the original search results with the appended column. The name of the appended column is Predicted (field_A). This field contains predicted values for all of the results. In this example, although there is an empty field in our target column, a predicted result still generates. This works because the predicted value is generated from all the other available fields, not from the target field value.

This image shows a table with the original search results and the appended column called Predicted field A. The first 5 columns are the original search results. The last field is the appended field with the predictions.

4. The learned, machine learning model is encoded and saved as a knowledge object

If the chosen algorithm supports saved models, and the into clause is included in the fit command, the learned model is saved as a knowledge object.

When the temporary model file is saved, it becomes a permanent model file. These permanent model files are sometimes referred to as learned models or encoded lookups. The learned model is saved on disk. The model follows all of the Splunk knowledge object rules, including permissions and bundle replication.

If the algorithm does not support saved models, or the into clause is not included, the temporary model is deleted.

Steps for the apply command

The apply command goes through a series of steps to re-convert data learned during fit. The apply command generally runs on a small slice of data that is different data than used for training the model with the fit command. The apply command generates new insight columns.

Coefficients created through the fit command and the resulting model artifact are already computed and saved, making the apply command fast to run. You can think of apply like a streaming command that's applied to data.

The Machine Learning Toolkit performs these steps when running the apply command:

  1. Load the learned model.
  2. Transform search results using data preparation actions:
    1. Discard any fields that are null throughout all the events.
    2. Discard non-numeric fields with more than (>) 100 distinct values.
    3. Convert non-numeric fields into dummy variables using one-hot encoding.
    4. Discard dummy variables that are not present in the learned model.
    5. Replace missing dummy variables with zeros.
    6. Convert the prepared data into a numeric matrix.
  3. Apply the model to the prepared data and produce new columns that display the prediction.

1. Load the learned model

The learned model specified by the search command is loaded in memory. Normal knowledge object permission parameters apply. The following examples show the apply command loading the learned model:

...| apply temp_model
...| apply user_behavior_clusters

2. Transform search results using data preparation actions

The data must be properly prepared to be suitable for machine learning and running though the selected algorithm. The following actions all take place on the search results copy.

a) Discard any fields that are null throughout all the events

The apply command discards fields that contain no values.

b) Discard non-numeric fields with more than (>)100 distinct values

The apply command discards non-numeric fields if the fields have more than 100 distinct values.

The limit for distinct values is set to 100 by default. You can change the limit by changing the max_distinct_cat_values attribute in your local copy of the mlspl.conf file. See Configure the fit and apply commands for details on updating the mlspl.conf file attributes.

  • Only users with file system access, such as system administrators, can make changes to the mlspl.conf file.
  • Refer to the Splunk Admin Manual to review the steps for How to edit a configuration file.

Do not change or copy the configuration files in the default directory. The files in the default directory must remain intact and in their original location. Make the changes in the local directory.

c) Convert non-numeric fields into dummy variables using one-hot encoding

The apply command converts fields that contain strings or characters into numbers. Algorithms perform best with numeric data, not categorical data. The apply command converts non-numeric fields to binary indicator variables (1 or 0) using one-hot encoding.

When converting categorical variables, a new value might come up in the data. In this example, there is new color data for yellow. This new data requires the one-hot encoding step, which converts column D into a binary value (1 or 0). In the graphic below, there is a new column for field_D=yellow.

This image shows the same table as described in the dummy coding step with the fit command. An additional column is now present for field_D=yellow.

d) Discard dummy variables that are not present in the learned model

The apply command removes data that is not part of the learned (saved) model. The data for the color yellow did not appear during the fit process. As such, the column created in the convert non-numeric fields step is discarded.

e) Replace missing dummy variables with zeros

Any result with missing dummy variables are automatically filled with the value of 0 at this step. Replacing missing fields with 0 is a standard machine learning practice that's required in order for the algorithm to be applied.

f) Convert the prepared data into a numeric matrix

The data is now in a clean, numeric matrix. The model file is applied to this matrix and the results are calculated.

3. Apply the model to the prepared data and produce new columns that display the prediction

The apply command returns to the prepared data and adds the results columns to the search pipeline. In this example, although there is an empty field in our target column, a predicted result still generates. This works because the predicted value is generated from all the other available fields, not from the target field value.

This image shows a table with the original search results and the appended field, named Predicted field A. There are 6 fields. The first 5 fields are the original search results. The last field is the appended field with the predictions.

See also

PREVIOUS
Search commands for machine learning permissions
  NEXT
Search macros in the Machine Learning Toolkit

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 4.4.0, 4.4.1, 4.4.2, 4.5.0, 5.0.0


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters