About the fit and apply commands

The fit and apply commands are 2 of the custom machine learning (ML) Splunk Search Processing Language (SPL) commands included in the Splunk Machine Learning Toolkit (MLTK). These ML-SPL commands implement classic machine learning and statistical learning tasks.

The fit command trains and saves a model from search results. The apply command applies that model to new search results to provide predictions or other insights.

The fit and apply commands work on relative searches with relative time ranges, but will not complete on real-time searches.

Before you begin

Before training your own models, your data might benefit from preprocessing. Data preprocessing can help address missing values, remedy unit disagreements, convert categorical data to numeric, and more. Running the fit command does include some lightweight data preprocessing, but a clean, numeric matrix of data is best for building your machine learning models.

To learn about data preprocessing options see Preparing your data for machine learning and Preprocessing your data using the Splunk Machine Learning Toolkit.

Actions taken by the fit command

The following actions take place when the fit command is run:

The fit command modifies a model. The command is considered risky because running it can cause performance issues. As a result, this command triggers SPL safeguards. To learn more, see Search commands for machine learning safeguards.

Search results pull into memory

When you run the fit command the search results pull into memory, a copy is created of the search results, and the search results are parsed into a Pandas DataFrame format. The originally ingested data is not changed.

Search results copy gets prepared for machine learning

Your data must be prepared to make it suitable for machine learning. The following actions all take place on the search results copy.

Action Description

Fields with any null values are discarded

The fit command discards fields that contain no values. To train a model, the machine learning algorithm requires all search results have a value. Any null value means the entire event will not contribute towards the learned model. The fit command discards any fields that are null throughout all of the events, and drops every event that has one or more null fields.

If you do not want null fields to be removed from the search results you must change your search.

You can specify that any search results with null values be included in the learned model. Or choose to replace null values if you want the algorithm to learn from an example with a null value and to return an empty collection. Or choose to replace null values if you want the algorithm to learn from an example with a null value and to throw an exception.

To include the results with null values in the model, you must replace the null values before using the fit command in your search. You can replace null values by using SPL commands such as fillnull, filldown, or eval. To learn more about data preprocessing options see Preparing your data for machine learning.

Non-numeric fields with more than 100 distinct values are discarded

The fit command discards non-numeric fields if the fields have more than 100 distinct values. In machine learning, many algorithms do not perform well with high-cardinality fields, because every unique, non-numeric entry in a field becomes an independent feature.

The limit for distinct values is set to 100 by default. You can change the limit by changing the max_distinct_cat_values attribute in your local copy of the mlspl.conf file. For details on updating the mlspl.conf file attribute, see Configure algorithm performance costs.

Only users with admin permissions can make changes to the mlspl.conf file. To learn more about changing .conf files, see How to edit a configuration file in the Splunk Admin Manual.

An alternative to discarding fields is to use the values to generate a usable feature set by using SPL commands such as streamstats or eventstats. You must generate these calculations in your search before running the fit command.

To learn more about data preprocessing options see Preparing your data for machine learning.

Non-numeric fields are converted into dummy variables using one-hot encoding Algorithms perform best with numeric data rather than categorical data. The fit command converts fields that contain strings or characters into numbers. The fit command makes this conversion using one-hot encoding. One-hot encoding converts categorical values to the binary values of 1 or 0.

Prepared data converted into numeric matrix and specified algorithm runs

The prepared data is converted into a numeric matrix that's ready to be processed by the selected algorithm, and trained to become the machine learning model. A temporary model is created in memory.

Model applied to the prepared data

The fit command applies the temporary model to the prepared data. The fit command appends one or more columns to display the results. The appended search results are then returned to the search pipeline.

Learned model encoded and saved as knowledge object

When the temporary model file is saved, it becomes a permanent model file. These permanent model files are sometimes referred to as learned models or encoded lookups. The learned model is saved on disk. The model follows all of the Splunk knowledge object rules, including permissions and bundle replication.

When the chosen algorithm supports saved models, and the into clause is included in the fit command, and the learned model is saved as a knowledge object. If the algorithm does not support saved models, or the into clause is not included, the temporary model is deleted.

Actions taken by the apply command

The apply command completes several actions to re-convert data learned when running the fit command. The apply command generally runs on a small slice of data that is different data than that used for training the model with the fit command. The apply command generates insights into new columns.

Coefficients created through the fit command and the resulting model artifact are already computed and saved, making the apply command fast to run. The apply command operates like a streaming command applied to data.

The following actions take place when the apply command is run:

Load the learned model

The learned model specified by the search command is loaded in memory. Standard knowledge object permissions are used. The following are examples of the apply command loading the learned model:

...| apply temp_model

...| apply user_behavior_clusters

Loaded model data gets prepared

The data must be properly prepared to be suitable for machine learning and running though the selected algorithm. The following actions all take place on the search results copy.

Action	Description
Fields with any null values throughout all the events are discarded	The `apply` command discards fields that contain no values.
Non-numeric fields with more than 100 distinct values are discarded	The `apply` command discards non-numeric fields if the fields have more than 100 distinct values. The limit for distinct values is set to 100 by default. You can change the limit by changing the `max_distinct_cat_values` attribute in your local copy of the mlspl.conf file. For details on updating the mlspl.conf file attribute, see Configure algorithm performance costs. Only users with admin permissions can make changes to the mlspl.conf file. To learn more about changing .conf files, see How to edit a configuration file in the Splunk Admin Manual.
Non-numeric fields are converted into dummy variables using one-hot encoding	Algorithms perform better with numeric data, versus categorical data. The `apply` command converts fields that contain strings or characters into numbers using one-hot encoding. The `apply` command converts non-numeric fields to the binary variables of 1 or 0.
Dummy values not present in the learned model are discarded	The `apply` command removes data that is not part of the learned model.
Missing dummy variable replaced with 0	Replacing missing fields with 0 is a standard machine learning practice that's required for the algorithm to be applied. Any result with missing dummy variables are automatically filled with the value of 0.

Prepared data converted into numeric matrix

The prepared data is converted into a numeric matrix. The model file is applied to this matrix and the results are calculated.

Model applied to prepared data and results displayed

The apply command returns to the prepared data and adds the results columns to the search pipeline.

Related answers from Splunk Community

About the fit and apply commands

Before you begin

Actions taken by the fit command

Search results pull into memory

Search results copy gets prepared for machine learning

Prepared data converted into numeric matrix and specified algorithm runs

Model applied to the prepared data

Learned model encoded and saved as knowledge object

Actions taken by the apply command

Load the learned model

Loaded model data gets prepared

Prepared data converted into numeric matrix

Model applied to prepared data and results displayed

See also

Comments

About the fit and apply commands

Was this topic useful?