Splunk® Machine Learning Toolkit

User Guide

About the fit and apply commands

The fit and apply commands are 2 of the custom machine learning (ML) Splunk Search Processing Language (SPL) commands included in the Splunk Machine Learning Toolkit (MLTK). These ML-SPL commands implement classic machine learning and statistical learning tasks.

The fit command trains and saves a model from search results. The apply command applies that model to new search results to provide predictions or other insights.

The fit and apply commands work on relative searches with relative time ranges, but will not complete on real-time searches.

Before you begin

Before training your own models, your data might benefit from preprocessing. Data preprocessing can help address missing values, remedy unit disagreements, convert categorical data to numeric, and more. Running the fit command does include some lightweight data preprocessing, but a clean, numeric matrix of data is best for building your machine learning models.

To learn about data preprocessing options see Preparing your data for machine learning and Preprocessing your data using the Splunk Machine Learning Toolkit.

Actions taken by the fit command

The following actions take place when the fit command is run:

This image is a simple diagram of the actions taken when you run the fit command on a dataset. Each of these actions are described in the next sections.

The fit command modifies a model. The command is considered risky because running it can cause performance issues. As a result, this command triggers SPL safeguards. To learn more, see Search commands for machine learning safeguards.

Search results pull into memory

When you run the fit command the search results pull into memory, a copy is created of the search results, and the search results are parsed into a Pandas DataFrame format. The originally ingested data is not changed.

Search results copy gets prepared for machine learning

Your data must be prepared to make it suitable for machine learning. The following actions all take place on the search results copy.

Action Description
Fields with any null values are discarded The fit command discards fields that contain no values. To train a model, the machine learning algorithm requires all search results have a value. Any null value means the entire event will not contribute towards the learned model. The fit command discards any fields that are null throughout all of the events, and drops every event that has one or more null fields.

If you do not want null fields to be removed from the search results you must change your search.

You can specify that any search results with null values be included in the learned model. Or choose to replace null values if you want the algorithm to learn from an example with a null value and to return an empty collection. Or choose to replace null values if you want the algorithm to learn from an example with a null value and to throw an exception.

To include the results with null values in the model, you must replace the null values before using the fit command in your search. You can replace null values by using SPL commands such as fillnull, filldown, or eval. To learn more about data preprocessing options see Preparing your data for machine learning.

Non-numeric fields with more than 100 distinct values are discarded The fit command discards non-numeric fields if the fields have more than 100 distinct values. In machine learning, many algorithms do not perform well with high-cardinality fields, because every unique, non-numeric entry in a field becomes an independent feature.

The limit for distinct values is set to 100 by default. You can change the limit by changing the max_distinct_cat_values attribute in your local copy of the mlspl.conf file. For details on updating the mlspl.conf file attribute, see Configure algorithm performance costs.

Only users with admin permissions can make changes to the mlspl.conf file. To learn more about changing .conf files, see How to edit a configuration file in the Splunk Admin Manual.

An alternative to discarding fields is to use the values to generate a usable feature set by using SPL commands such as streamstats or eventstats. You must generate these calculations in your search before running the fit command.

To learn more about data preprocessing options see Preparing your data for machine learning.

Non-numeric fields are converted into dummy variables using one-hot encoding Algorithms perform best with numeric data rather than categorical data. The fit command converts fields that contain strings or characters into numbers. The fit command makes this conversion using one-hot encoding. One-hot encoding converts categorical values to the binary values of 1 or 0.

Prepared data converted into numeric matrix and specified algorithm runs

The prepared data is converted into a numeric matrix that's ready to be processed by the selected algorithm, and trained to become the machine learning model. A temporary model is created in memory.

Model applied to the prepared data

The fit command applies the temporary model to the prepared data. The fit command appends one or more columns to display the results. The appended search results are then returned to the search pipeline.

Learned model encoded and saved as knowledge object

When the temporary model file is saved, it becomes a permanent model file. These permanent model files are sometimes referred to as learned models or encoded lookups. The learned model is saved on disk. The model follows all of the Splunk knowledge object rules, including permissions and bundle replication.

When the chosen algorithm supports saved models, and the into clause is included in the fit command, and the learned model is saved as a knowledge object. If the algorithm does not support saved models, or the into clause is not included, the temporary model is deleted.

Actions taken by the apply command

The apply command completes several actions to re-convert data learned when running the fit command. The apply command generally runs on a small slice of data that is different data than that used for training the model with the fit command. The apply command generates insights into new columns.

Coefficients created through the fit command and the resulting model artifact are already computed and saved, making the apply command fast to run. The apply command operates like a streaming command applied to data.

The following actions take place when the apply command is run: This image is a simple diagram of the actions taken when you run the apply command. Each of these actions are described in the next sections.

Load the learned model

The learned model specified by the search command is loaded in memory. Standard knowledge object permissions are used. The following are examples of the apply command loading the learned model:

...| apply temp_model
...| apply user_behavior_clusters

Loaded model data gets prepared

The data must be properly prepared to be suitable for machine learning and running though the selected algorithm. The following actions all take place on the search results copy.

Action Description
Fields with any null values throughout all the events are discarded The apply command discards fields that contain no values.
Non-numeric fields with more than 100 distinct values are discarded The apply command discards non-numeric fields if the fields have more than 100 distinct values.

The limit for distinct values is set to 100 by default. You can change the limit by changing the max_distinct_cat_values attribute in your local copy of the mlspl.conf file. For details on updating the mlspl.conf file attribute, see Configure algorithm performance costs.

Only users with admin permissions can make changes to the mlspl.conf file. To learn more about changing .conf files, see How to edit a configuration file in the Splunk Admin Manual.

Non-numeric fields are converted into dummy variables using one-hot encoding Algorithms perform better with numeric data, versus categorical data. The apply command converts fields that contain strings or characters into numbers using one-hot encoding. The apply command converts non-numeric fields to the binary variables of 1 or 0.
Dummy values not present in the learned model are discarded The apply command removes data that is not part of the learned model.
Missing dummy variable replaced with 0 Replacing missing fields with 0 is a standard machine learning practice that's required for the algorithm to be applied. Any result with missing dummy variables are automatically filled with the value of 0.

Prepared data converted into numeric matrix

The prepared data is converted into a numeric matrix. The model file is applied to this matrix and the results are calculated.

Model applied to prepared data and results displayed

The apply command returns to the prepared data and adds the results columns to the search pipeline.

See also

See the following resources for additional information on ML-SPL commands in MLTK:

Last modified on 19 February, 2025
Search commands for machine learning permissions   Search commands for machine learning safeguards

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.5.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters