
Configure algorithm performance costs
Running machine learning algorithms impacts your Splunk platform performance costs. Machine learning requires compute resources, run-time, memory, and disk space. Each machine learning algorithm has a different cost, complicated by the number of input fields you select and the total number of events being processed.
The Machine Learning Toolkit's mlspl.conf
file controls the machine learning resources used and ships with conservative default settings to prevent the overloading of a search head. The default settings for the mlspl.conf
file are set to intelligently sample down to 100K events.
Users of Splunk on-prem with Admin access can configure the settings of the mlspl.conf
file through the Settings tab or the command line. Changes can be made across all algorithms, or by individual algorithm. Users of Splunk Cloud, whether one instance or a cluster of instances, need to create a support ticket to make changes to the mlspl.conf
file.
Prior to making algorithm setting changes, ensure you know the impact of those changes by adding the ML-SPL Performance App for the Machine Learning Toolkit to your setup through Splunkbase. Cloud users need to create a support ticket to get this app installed. The ML-SPL Performance App provides access to performance results for guidance and bench-marking purposes.
Make changes using the Settings tab
Users of the on-prem Splunk platform with Admin access can configure the settings of mlspl.conf
file through the Settings tab of the Machine Learning Toolkit. Follow these steps to make your changes:
- In your Splunk instance, choose the Machine Learning Toolkit app.
- Within the Machine Learning Toolkit app choose the Settings tab from the nav bar.
- Choose Edit Default Settings to change the
mlspl.conf
file configurations across all algorithms, or click the name of an individual algorithm to change the settings for a particular algorithm. - Within the settings page, hover over any setting name for additional information. Make changes to one or more setting fields as desired.
- Click Save when done. Repeat these steps as needed.
You must close and relaunch the Splunk platform for any mlspl.conf
setting changes to take effect.
Setting descriptions
Setting | Default | Description |
---|---|---|
handle_new_cat |
default | Action to perform when new value(s) for categorical variable/ explanatory variable is encountered in partial_fit . Default sets all values of the column that correspond to the new categorical value to zeroes. Skip skips over rows that contain the new value(s) and raises a warning. Stop stops the operation by raising an error.
|
max_distinct_cat_values |
100 | The maximum number of distinct values in a categorical feature field, or input field, that will be used in one-hot encoding. One-hot encoding is when you convert categorical values to numeric values. If the number of distinct values exceeds this limit, the field will be dropped, or excluded from analysis, and a warning appears. |
max_distinct_cat_values_for_classifiers |
100 | The maximum number of distinct values in a categorical field that is the target, or output, variable in a classifier algorithm. |
max_distinct_cat_values_for_scoring |
100 | Determines the upper limit for the number of distinct values in a categorical field that is the target (or response) variable in a scoring method. If the number of distinct values exceeds this limit, the field will be dropped (with an appropriate error message). |
max_fit_time |
600 | The maximum time, in seconds, to spend in the "fit" phase of an algorithm. This setting does not relate to the other phases of a search such as retrieving events from an index. |
max_inputs |
100000 | The maximum number of events an algorithm considers when fitting a model. If this limit is exceeded and use_sampling is true, the fit command downsamples its input using the Reservoir Sampling algorithm before fitting a model. If use_sampling is false and this limit is exceeded, the fit command throws an error.
|
max_memory_usage_mb |
1000 | The maximum amount of memory in megabytes that can be used by the fit command when computing the model.
|
max_model_size_mb |
15 | The maximum allowed amount of space in megabytes that the final model as created by the fit command is allowed to take up on disk. Some algorithms such as SVM and RandomForest, might create unusually large models, which can lead to performance problems with bundle replication.
|
max_score_time |
600 | The maximum time, in seconds, to spend in the "score" phase of an algorithm. |
streaming_apply |
false | Setting streaming_apply to true allows the execution of apply command at indexer level. Otherwise apply is done on search head.
|
use_sampling |
true | Indicates whether to use Reservoir Sampling for data sets that exceed max_inputs or to instead throw an error.
|
Make changes using the command line interface
Users of the on-prem Splunk platform with Admin access can configure the settings of mlspl.conf file through the command line interface. You can configure the fit
and apply
commands through the setting properties on the mlspl.conf
configuration file that's located in the default directory.
To avoid losing your configuration file changes when you upgrade the app, create a copy of the mlspl.conf
file with only the modified stanzas and settings, then save it to $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/local/
Follow these steps to make your changes:
- Access the default directory with the following command:
$SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/default/mlspl.conf
- In this file, you can specify default settings for all algorithms, or for an individual algorithm.
To apply global settings, use the [default] stanza and algorithm-specific settings in a stanza named for the algorithm, for example, [LinearRegression] for the LinearRegression algorithm.
Not all global settings can be set or overwritten in an algorithm-specific section. For details, see [http://docs.splunk.com/Documentation/Splunk/8.1.3/Admin/Howtoeditaconfigurationfile How to copy and edit a configuration file].
Setting descriptions
Setting | Default | Description |
---|---|---|
max_inputs |
100000 | The maximum number of events an algorithm considers when fitting a model. If this limit is exceeded and use_sampling is true, the fit command downsamples its input using the Reservoir Sampling algorithm before fitting a model. If use_sampling is false and this limit is exceeded, the fit command throws an error.
|
use_sampling |
true | Indicates whether to use Reservoir Sampling for data sets that exceed max_inputs or to instead throw an error.
|
max_fit_time |
600 | The maximum time, in seconds, to spend in the "fit" phase of an algorithm. This setting does not relate to the other phases of a search such as retrieving events from an index. |
max_memory_usage_mb |
1000 | The maximum allowed memory usage, in megabytes, by the fit command while fitting a model.
|
max_model_size_mb |
15 | The maximum allowed size of a model, in megabytes, created by the fit command. Some algorithms such as SVM and RandomForest might create unusually large models, which can lead to performance problems with bundle replication.
|
max_distinct_cat_values |
100 | The maximum number of distinct values in a categorical feature field, or input field, that will be used in one-hot encoding. One-hot encoding is when you convert categorical values to numeric values. If the number of distinct values exceeds this limit, the field will be dropped, or excluded from analysis, and a warning appears. |
max_distinct_cat_values_for_classifiers |
100 | The maximum number of distinct values in a categorical field that is the target, or output, variable in a classifier algorithm. |
You must close and relaunch the Splunk platform for any mlspl.conf
setting changes to take effect.
PREVIOUS Algorithm support of key ML-SPL commands quick reference |
NEXT Scoring metrics in the Machine Learning Toolkit |
This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.0.0, 5.1.0, 5.2.0, 5.2.1
Feedback submitted, thanks!