Machine Learning Toolkit Troubleshooting in Splunk Enterprise Security

Troubleshoot MLTK in Splunk Enterprise Security. There are some known issues and potential workarounds.

Error messages

MLTK errors are found in the mlspl.log file. The errors themselves are not necessarily enough to troubleshoot the issues. The Machine Learning Audit dashboard helps to correlate MLTK errors with the corresponding failed searches. See Machine Learning Audit Dashboard.

Testing and training models overwrites them

MLTK replaces the models with every run if you're not using partial_fit=true. Even if you are using partial_fit=true, MLTK updates the original model, which you might not want. You can test in your user space without overwriting or updating the original model. MLTK model names with the app: prefix are saved into the shared application namespace, for example: ./apps/SA-AccessProtection/lookups/failures_by_src_count_1d.csv. If you are the admin user and you revise the search to remove the app: prefix, then it will save in the admin user space, such as ./users/admin/SplunkEnterpriseSecuritySuite/lookups/recipients_by_src_1h.csv, and it will not overwrite the original. The user and app name spaces depend on the user that is logged in and the app currently running. You can also revise the name of the model to avoid overwriting the original while testing.

Original model name:
| tstats `summariesonly` count as failure from datamodel=Authentication.Authentication where Authentication.action="failure" by Authentication.src,_time span=1h | fit DensityFunction failure dist=norm into app:failures_by_src_count_1h

Model name revised to save in non-app space:
| tstats `summariesonly` count as failure from datamodel=Authentication.Authentication where Authentication.action="failure" by Authentication.src,_time span=1h | fit DensityFunction failure dist=norm into failures_by_src_count_1h

Model name revised to include testing:
| tstats `summariesonly` count as failure from datamodel=Authentication.Authentication where Authentication.action="failure" by Authentication.src,_time span=1h | fit DensityFunction failure dist=norm into app:testing_failures_by_src_count_1h

Maximum group limit

There is a limit of 1024 on the maximum number of groups that can be created when using the MLTK DensityFunction with a by clause. If you have custom searches that you're converting to MLTK, depending what you use to split your searches, the results will not display if the number of groups is too large to split with the by clause. To change the limit, change the value of the max_groups field in the DensityFunction stanza of the mlspl.conf file in the Machine Learning Toolkit app.

Example search

| tstats `summariesonly` count as dest_port_traffic_count from datamodel=Network_Traffic.All_Traffic by All_Traffic.dest_port,_time span=1d | `drop_dm_object_name("All_Traffic")` | fit DensityFunction dest_port_traffic_count by dest_port dist=norm into app:count_by_dest_port_1d

Example error message
Error in 'fit' command: Error while fitting "DensityFunction model: The number of groups cannot exceed <abc>; the current number of groups is <xyz>."

See the syntax constraints of the Density Function in the Splunk Machine Learning Toolkit User Guide.

CSV required

There's a lookup table file at $SPLUNK_HOME/etc/apps/SA-Utils/lookups/qualitative_thresholds.csv that's required for using the qualitative_id thresholds. If the CSV file is missing, then you can't use the qualitative_id thresholds for extreme, high, medium, low, and minimal.

MLTK-backed key performance indicator errors

For MLTK backed key indicator searches, the UI pages may show key indicator panels that are unable to load results. Instead, a default error message "Model not generated yet" may be displayed in the panel because the models corresponding to the key indicator searches were not updated to MLTK and the corresponding MLTK models of these qualitative key indicators were not generated yet.

To load these results and display the key indicator panels in the UI as expected, run the model gen searches for the key indicator search.

Run the model gen search

Navigate to the Enterprise Security UI page with the key indicator panels.
You may see that a few of the key indicator panels do not display as expected.
Click on the link Run Related Search that appears below the error message in the key indicator panel.
This opens a dialog box that prompts you to run the corresponding model generating search.
Click Run to run the model generating search.
Once the model generating search completes, reload the UI page with the key indicators to display the panels as expected.

Python3 and MLTK 5.x

When the Python2 to Python3 cut-off happens, such as in MLTK 5.x, the previously generated models from MLTK 4.x will not be compatible and will have to be regenerated. This may not be an issue since the model-gen searches run on a daily basis anyway. However, you will have to re-run models immediately after upgrading to MLTK 5.x if you want to use MLTK searches.

See Update Splunk MLTK models for Python 3 in the Splunk Enterprise Python 3 Migration guide.

Related answers from Splunk Community

Machine Learning Toolkit Troubleshooting in Splunk Enterprise Security

Error messages

Testing and training models overwrites them

Maximum group limit

CSV required

MLTK-backed key performance indicator errors

Run the model gen search

Python3 and MLTK 5.x

Comments

Machine Learning Toolkit Troubleshooting in Splunk Enterprise Security

Was this topic useful?