Machine Learning Toolkit Troubleshooting in Splunk Enterprise Security

Troubleshoot MLTK in Splunk Enterprise Security. There are some known issues and potential workarounds.

Error messages

MLTK errors are found in the mlspl.log file. The errors themselves are not necessarily enough to troubleshoot the issues. The Machine Learning Audit dashboard helps to correlate MLTK errors with the corresponding failed searches. See Machine Learning Audit Dashboard.

Testing and training models overwrites them

MLTK replaces the models with every run if you're not using partial_fit=true. Even if you are using partial_fit=true, MLTK updates the original model, which you might not want. You can test in your user space without overwriting or updating the original model. MLTK model names with the app: prefix are saved into the shared application namespace, for example: ./apps/SA-AccessProtection/lookups/failures_by_src_count_1d.csv. If you are the admin user and you revise the search to remove the app: prefix, then it will save in the admin user space, such as ./users/admin/SplunkEnterpriseSecuritySuite/lookups/recipients_by_src_1h.csv, and it will not overwrite the original. The user and app name spaces depend on the user that is logged in and the app currently running. You can also revise the name of the model to avoid overwriting the original while testing.

Original model name:
| tstats `summariesonly` count as failure from datamodel=Authentication.Authentication where Authentication.action="failure" by Authentication.src,_time span=1h | fit DensityFunction failure dist=norm into app:failures_by_src_count_1h

Model name revised to save in non-app space:
| tstats `summariesonly` count as failure from datamodel=Authentication.Authentication where Authentication.action="failure" by Authentication.src,_time span=1h | fit DensityFunction failure dist=norm into failures_by_src_count_1h

Model name revised to include testing:
| tstats `summariesonly` count as failure from datamodel=Authentication.Authentication where Authentication.action="failure" by Authentication.src,_time span=1h | fit DensityFunction failure dist=norm into app:testing_failures_by_src_count_1h

Maximum group limit

There is a limit of 1024 on the maximum number of groups that can be created when using the MLTK DensityFunction with a by clause. If you have custom searches that you're converting to MLTK, depending what you use to split your searches, the results will not display if the number of groups is too large to split with the by clause. To change the limit, change the value of the max_groups field in the DensityFunction stanza of the mlspl.conf file in the Machine Learning Toolkit app.

Example search

| tstats `summariesonly` count as dest_port_traffic_count from datamodel=Network_Traffic.All_Traffic by All_Traffic.dest_port,_time span=1d | `drop_dm_object_name("All_Traffic")` | fit DensityFunction dest_port_traffic_count by dest_port dist=norm into app:count_by_dest_port_1d

Example error message
Error in 'fit' command: Error while fitting "DensityFunction model: The number of groups cannot exceed <abc>; the current number of groups is <xyz>."

See the syntax constraints of the Density Function in the Splunk Machine Learning Toolkit User Guide.

CSV required

There's a lookup table file at $SPLUNK_HOME/etc/apps/SA-Utils/lookups/qualitative_thresholds.csv that's required for using the qualitative_id thresholds. If the CSV file is missing, then you can't use the qualitative_id thresholds for extreme, high, medium, low, and minimal.

MLTK-backed key performance indicator errors

The Risk Analysis page shows risk scores that are "unable to load results" for up to one day after a risk modifier has been created.

This occurs because the key security indicator searches have been updated to MLTK, and the corresponding MLTK models of these qualitative key indicators haven't been generated yet.

To load these results, manually run the following searches from Configure > Content > Content Management:

Risk - Median Object Risk Per Day - Model Gen
Risk - Total Risk By Risk Object Type Per Day - Model Gen

Python3 and MLTK 5.x

When the Python2 to Python3 cut-off happens, such as in MLTK 5.x, the previously generated models from MLTK 4.x will not be compatible and will have to be regenerated. This may not be an issue since the model-gen searches run on a daily basis anyway. However, you will have to re-run models immediately after upgrading to MLTK 5.x if you want to use MLTK searches.

See Update Splunk MLTK models for Python 3 in the Splunk Enterprise Python 3 Migration guide.

Related answers from Splunk Community

Machine Learning Toolkit Troubleshooting in Splunk Enterprise Security

Error messages

Testing and training models overwrites them

Maximum group limit

CSV required

MLTK-backed key performance indicator errors

Python3 and MLTK 5.x

Comments

Machine Learning Toolkit Troubleshooting in Splunk Enterprise Security

Was this topic useful?