Splunk® Machine Learning Toolkit

User Guide

Download manual as PDF

Download topic as PDF

Configure algorithm performance costs

Running machine learning algorithms impacts your Splunk platform performance costs.  Machine learning requires compute resources, run-time, memory, and disk space. Each machine learning algorithm has a different cost, complicated by the number of input fields you select and the total number of events being processed.

The Machine Learning Toolkit's mlspl.conf file controls the machine learning resources used and ships with conservative default settings to prevent the overloading of a search head. The default settings for the mlspl.conf file are set to intelligently sample down to 100K events.

Users of Splunk on-prem with Admin access can configure the settings of the mlspl.conf file through the Settings tab or the command line. Changes can be made across all algorithms, or by individual algorithm. Users of Splunk Cloud, whether one instance or a cluster of instances, need to create a support ticket to make changes to the mlspl.conf file.

Prior to making algorithm setting changes, ensure you know the impact of those changes by adding the ML-SPL Performance App for the Machine Learning Toolkit to your setup through Splunkbase. Cloud users need to create a support ticket to get this app installed. The ML-SPL Performance App provides access to performance results for guidance and bench-marking purposes.

Make changes using the Settings tab

Users of the on-prem Splunk platform with Admin access can configure the settings of mlspl.conf file through the Settings tab of the Machine Learning Toolkit. Follow these steps to make your changes:

  1. In your Splunk instance, choose the Machine Learning Toolkit app.
  2. This image shows the Splunk platform main page. The Machine Learning Toolkit app is highlighted on the left menu bar.

  3. Within the Machine Learning Toolkit app choose the Settings tab from the nav bar.
  4. This image shows the landing page within the MLTK app. The tab for Settings is highlighted on the top navigation bar for the MLTK app.

  5. Choose Edit Default Settings to change the mlspl.conf file configurations across all algorithms, or click the name of an individual algorithm to change the settings for a particular algorithm.
  6. This image shows the view within the Settings tab of the MLTK app. In the top right corner is a button labeled Edit Default Settings. The body of the page shows a list of all the algorithms within the MLTK app. The algorithm for Birch is highlighted to show that you can pick an individual algorithm to adjust its settings from this page.

  7. Within the settings page, hover over any setting name for additional information. Make changes to one or more setting fields as desired.
  8. This image shows the view from selecting one algorithm from the list of algorithms under the Settings tab of the MLTK app. Several editable fields are listed including max fit time and use sampling. A green colored button labeled Save is highlighted.

  9. Click Save when done. Repeat these steps as needed.

You must close and relaunch the Splunk platform for any mlspl.conf setting changes to take effect.

Setting descriptions

Setting Default Description
handle_new_cat default Action to perform when new value(s) for categorical variable/ explanatory variable is encountered in partial_fit. Default sets all values of the column that correspond to the new categorical value to zeroes. Skip skips over rows that contain the new value(s) and raises a warning. Stop stops the operation by raising an error.
max_distinct_cat_values 100 The maximum number of distinct values in a categorical feature field, or input field, that will be used in one-hot encoding. One-hot encoding is when you convert categorical values to numeric values. If the number of distinct values exceeds this limit, the field will be dropped, or excluded from analysis, and a warning appears.
max_distinct_cat_values_for_classifiers 100 The maximum number of distinct values in a categorical field that is the target, or output, variable in a classifier algorithm.
max_distinct_cat_values_for_scoring 100 Determines the upper limit for the number of distinct values in a categorical field that is the target (or response) variable in a scoring method. If the number of distinct values exceeds this limit, the field will be dropped (with an appropriate error message).
max_fit_time 600 The maximum time, in seconds, to spend in the "fit" phase of an algorithm. This setting does not relate to the other phases of a search such as retrieving events from an index.
max_inputs 100000 The maximum number of events an algorithm considers when fitting a model. If this limit is exceeded and use_sampling is true, the fit command downsamples its input using the Reservoir Sampling algorithm before fitting a model. If use_sampling is false and this limit is exceeded, the fit command throws an error.
max_memory_usage_mb 1000 The maximum amount of memory in megabytes that can be used by the fit command when computing the model.
max_model_size_mb 15 The maximum allowed amount of space in megabytes that the final model as created by the fit command is allowed to take up on disk. Some algorithms such as SVM and RandomForest, might create unusually large models, which can lead to performance problems with bundle replication.
max_score_time 600 The maximum time, in seconds, to spend in the "score" phase of an algorithm.
streaming_apply false Setting streaming_apply to true allows the execution of apply command at indexer level. Otherwise apply is done on search head.
use_sampling true Indicates whether to use Reservoir Sampling for data sets that exceed max_inputs or to instead throw an error.

Make changes using the command line interface

Users of the on-prem Splunk platform with Admin access can configure the settings of mlspl.conf file through the command line interface. You can configure the fit and apply commands through the setting properties on the mlspl.conf configuration file that's located in the default directory.

To avoid losing your configuration file changes when you upgrade the app, create a copy of the mlspl.conf file with only the modified stanzas and settings, then save it to $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/local/

Follow these steps to make your changes:

  1. Access the default directory with the following command:
    $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/default/mlspl.conf
  2. In this file, you can specify default settings for all algorithms, or for an individual algorithm.
    To apply global settings, use the [default] stanza and algorithm-specific settings in a stanza named for the algorithm, for example, [LinearRegression] for the LinearRegression algorithm.

Not all global settings can be set or overwritten in an algorithm-specific section. For details, see [http://docs.splunk.com/Documentation/Splunk/8.0.6/Admin/Howtoeditaconfigurationfile How to copy and edit a configuration file].

Setting descriptions

Setting Default Description
max_inputs 100000 The maximum number of events an algorithm considers when fitting a model. If this limit is exceeded and use_sampling is true, the fit command downsamples its input using the Reservoir Sampling algorithm before fitting a model. If use_sampling is false and this limit is exceeded, the fit command throws an error.
use_sampling true Indicates whether to use Reservoir Sampling for data sets that exceed max_inputs or to instead throw an error.
max_fit_time 600 The maximum time, in seconds, to spend in the "fit" phase of an algorithm. This setting does not relate to the other phases of a search such as retrieving events from an index.
max_memory_usage_mb 1000 The maximum allowed memory usage, in megabytes, by the fit command while fitting a model.
max_model_size_mb 15 The maximum allowed size of a model, in megabytes, created by the fit command. Some algorithms such as SVM and RandomForest might create unusually large models, which can lead to performance problems with bundle replication.
max_distinct_cat_values 100 The maximum number of distinct values in a categorical feature field, or input field, that will be used in one-hot encoding. One-hot encoding is when you convert categorical values to numeric values. If the number of distinct values exceeds this limit, the field will be dropped, or excluded from analysis, and a warning appears.
max_distinct_cat_values_for_classifiers 100 The maximum number of distinct values in a categorical field that is the target, or output, variable in a classifier algorithm.

You must close and relaunch the Splunk platform for any mlspl.conf setting changes to take effect.

Last modified on 24 September, 2020
PREVIOUS
Algorithm support of key ML-SPL commands quick reference
  NEXT
Scoring metrics in the Machine Learning Toolkit

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.0.0, 5.1.0, 5.2.0


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters