Splunk® Machine Learning Toolkit

User Guide

Acrobat logo Download manual as PDF

Acrobat logo Download topic as PDF

Frequently Asked Questions

Please see the following list of the most frequently asked questions about the Machine Learning Toolkit. Don't see the information you need? Ask your question and get answers through community support at Splunk Answers.

Getting started

How do I get started with machine learning in Splunk?

If you have not done so already, we recommend reviewing these introductory documents:

We are active bloggers, and encourage you to read our wonderful tutorials on machine learning and other related topics:

  • Interested in Statistical Forecasts and Anomalies? Check out Part 1, Part 2 and Part 3 of this blog
  • If you have ITSI or are interested in predictive analytics , check out this blog on ITSI and Sophisticated Machine Learning
  • If you have ITSI or are interested in custom anomaly detection check out Part 1 and Part 2 of this blog
  • If you are hungry for ice cream or anomalies in three flavors, check out Part 1 and Part 2 of this blog

Visit our YouTube channel to see the MLTK in action.

We also have an active GitHub Community where you can not only connect with other MLTK users but also share and reuse custom algorithms.

Splunk also offers a course we encourage you to take: Splunk for Data Science and Analytics

Do I have to use .csv files to load data into the MLTK?

The Splunk platform accepts any type of data. In particular, it works with all IT streaming and historical data. The source of the data can be event logs, web logs, live application logs, network feeds, system metrics, change monitoring, message queues, archive files, data from indexes, third-party data sources, and so on. Basically any data that can be retrieved by a Splunk search can be used by the MLTK.

In general, data sources are grouped into the following categories.

Data source Description
Files and directories Most data that you might be interested in comes directly from files and directories.
Network events The Splunk software can index remote data from any network port and SNMP events from remote devices.
Windows sources The Windows version of Splunk software accepts a wide range of Windows-specific inputs, including Windows Event Log, Windows Registry, WMI, Active Directory, and Performance monitoring.
Other sources Other input sources are supported, such as FIFO queues and scripted inputs for getting data from APIs, and other remote data interfaces.

For many types of data, you can add the data directly to your Splunk deployment. If the data that you want to use is not automatically recognized by the Splunk software, you need to provide information about the data before you can add it.

What are the most common use cases for machine learning in Splunk?

The most common use cases are as follows:

  • Anomaly detection
  • Prediction Analytics (Forecasting)
  • Clustering

This graphic lists common examples of when anomaly detection, predictive analytics and clustering might be used.

What is a Machine Learning Model and how is it different from a Splunk Data Model?

A machine learning model is an encoded lookup file created by from a fit search command using the into clause, persisting the learned behaviors to a file on disk for use in later searches on net new data using the apply command.

Splunk Data Models are knowledge objects for organizing and accelerating your data in the Splunk platform.

Why do I need a dedicated search head for MLTK app?

You need a dedicated search head for the MLTK if you are freely experimenting and creating large numbers of machine learning models of substantial size. The search load and the machine learning workload can get large and impact your production search environment . For applying machine learning models in production (generally extremely light on resource use), or periodically retraining production models, you should be able to use your normal Splunk infrastructure. Please work with your Splunk admin for your specific Splunk deployment.

Why am I seeing the error of Error in `SearchOperator: loadjob`?

You need to configure sticky sessions on your load balancer. For further information, see Use a load balancer with search head clustering.

MLTK associated apps and add-ons

How do I know if I'm using the correct version of MLTK and the PSC add-on?

You can refer to the Machine Learning Toolkit version dependencies document to ensure you are running the correct version of the MLTK, Python, Python for Scientific Computing add-on, and Splunk Enterprise.

How can I assess the performance costs of MLTK searches?

Machine learning requires compute resources and disk space. Each algorithm has a different cost, complicated by the number of input fields you select and the total number of events processed. Model files are lookups and will increase bundle replication costs.

For each algorithm implemented in ML-SPL, we measure run time, CPU utilization, memory utilization, and disk activity when fitting models on up to 1,000,000 search results, and applying models on up to 10,000,000 search results, each with up to 50 fields.

Ensure you know the impact of making changes to the algorithm settings by adding the ML-SPL Performance App for the Machine Learning Toolkit to your setup via Splunkbase.

Why am I seeing an error when installing the PSC add-on?

On some Windows installations, installing the PSC add-on through the Splunk Manage Apps user interface might result in an error. This error can be ignored. In some cases it may be necessary to manually unpack the package in the apps directory to get past the error.

Can I use the MLTK in other apps? How do I do that?

Yes you can use ML-SPL commands in other apps. You need to make the MLTK global if you want to use the ML-SPL commands across all the apps. Remember that the model files follow all the same rules as Splunk lookupfiles- permissions, access control, and replication.

Please follow these steps:

  1. From the top navigation bar choose Apps ⇒ Manage Apps
  2. This image shows the MLTK Showcase screen with the Apps dropdown menu open. The Manage Apps option is highlighted from this menu.

  3. Find the Splunk Machine Learning Toolkit in the list, and click on the Permissions link in the Sharing column.
  4. This image shows the Permissions page with a list of Splunk apps. The MLTK app is listed, and the Permissions link is highlighted.

  5. Change the Sharing setting to All apps and (optionally) change any role based permissions as well. Click Save when done.
  6. This image shows the specific permissions for the MLTK. This page shows both roles as well as the app sharing permissions options. The All apps option is selected and highlighted.

We're using version 5.0.0 of the MLTK. Can we add Python 3 libraries?

For details on the steps to add Python 3 libraries, please see Adding Python 3 libraries in the ML-SPL API Guide.

Support is not offered on the use of or upgrade of any Python 3 libraries added to your Splunk platform instance. Any upgrade to the MLTK or the PSC add-on will overwrite any Python library changes.

MLTK know-how

What is partial fit and how does it work?

If an algorithm supports partial_fit, you can incrementally learn on net new data without loading the entire training history in a single search.

We recommend watching this brief video for details on the ways you can have your machine learning workflows update and learn through time.: How Does the Splunk Machine Learning Toolkit Learn?

As with the fit command, you want a lightweight search. Please refer back to this question for more information.

Several of the MLTK algorithms offer the partial_fit option. For which algorithms support this option, see Algorithms that support partial_fit. For a detailed list of available algorithms, see Algorithms in the MLTK.

Why am I seeing a security warning when I run the fit command?

There are a handful of scenarios that throw a security warning dialog box when the fit command is called.

This image shows an example of a warning message that can appear on screen when calling the fit command. The message prompts the user to Cancel, Run, or Investigate the search string.

The scenarios under which this warning appears include:

  • When the fit command is run for the first time after logging into the system with a URL.
  • When the user refreshes the page or logs back in with the URL.
  • When the user chooses the Open in Search option from within the MLTK.
  • When viewing certain Showcase examples.

Users with Write permission can edit the web.conf file to turn off the warning dialog box. For directions, see Turning off the warning dialog box.

What is automatic sampling/performance Settings for the MLTK and why should I change these?

By default, reservoir sampling is enabled and will start sampling once the maximum number of events crosses 100,000 in your search events prior to the fit command.

If you do not wish to enable reservoir sampling and have resources available on your Splunk machine, then you can disable it and change the number of maximum input to a preferred number. In an environment set aside for machine learning workloads, and to avoid impact with production searches, it is not uncommon to increase the max_inputs setting into the millions.

Please ensure you have enough compute and memory resources available before making these changes. You will likely need to change max_memory_usage_mb and other options in Settings as you increase the number of events you want to process.

Is there a way to increase the number of fields to forecast when using the Smart Forecasting Assistant?

There is a limit of five fields to forecast you can select for the Smart Forecasting Assistant. This limit is in place to make it easy for users to read and view the resulting charts. You can use more than five fields outside of the Assistant environment by using SPL in the search bar.

Algorithms and ML commands

Do I have options outside of the 30 native algorithms in the MLTK?

Yes! On-prem customers looking for solutions that fall outside of the 30 native algorithms can use GitHub to add more algorithms. Solve custom uses cases through sharing and reusing algorithms in the Splunk Community for MLTK on GitHub. Here you can also learn about new machine learning algorithms from the Splunk open source community, and connect with fellow users of the MLTK.

Splunk Cloud Platform customers can also use GitHub to add more algorithms via an app. The Splunk GitHub for Machine learning app provides access to custom algorithms and is based on the Machine Learning Toolkit open source repo. Splunk Cloud Platform customers need to create a support ticket to have this app installed.

To access the Machine Learning Toolkit open source repo, see the MLTK GitHub repo.

The Machine Learning Toolkit and Python for Scientific computing add-on must be installed in order for GitHub to work in your Splunk environment.

Which MLTK algorithms support partial_fit?

The BernolliNB, Birch, GaussianNB, MLPClassifier, StandardScaler, SGDClassifier, SGDRegressor, and StateSpaceForecast algorithms all support partial_fit or incremental fit.

To view this information in a table view, along with information on which algorithms support the fit, apply, and summary commands, see Algorithm support of key ML-SPL commands quick reference.

What are the side effects of the fit and apply commands on my data?

Machine learning commands from the MLTK are very powerful and have a number automation steps built into them. The fit and apply commands have a number of caveats and features to accelerate your success with machine learning in Splunk. See, Using the fit and apply commands.

At the highest level:

  • The fit command produces a learned model based on the behavior of a set of events.
  • The fit command then applies that model to the current set of search results in the search pipeline.
  • The apply command repeats the field selection of the fit command steps.

What the fit command does

  1. Search results pulled into memory.
  2. The fit command transforms the search results in memory through these data preparation actions:
    1. Discard fields that are null throughout all the events.
    2. Discard non-numeric fields with more than (>) 100 distinct values.
    3. Discard events with any null fields.
    4. Convert non-numeric fields into "dummy variables" by using one-hot encoding.
    5. Convert the prepared data into a numeric matrix representation and run the specified machine learning algorithm to create a model.
  3. Apply the model to the prepared data and produce new (predicted) columns.
  4. Learned model is encoded and saved as a knowledge object.

What the apply command does

  1. Load the learned model.
  2. The apply command transforms the search results in memory through these data preparation actions:
    1. Discard fields that are null throughout all the events.
    2. Discard non-numeric fields with more than (>) 100 distinct values.
    3. Convert non-numeric fields into "dummy variables" by using one-hot encoding.
    4. Discard dummy variables that are not present in the learned model.
    5. Fill missing dummy variables with zeros.
    6. Convert the prepared data into a numeric matrix representation.
  3. Apply the model to the prepared data and produce new (predicted) columns.

How do you nest multiple uses of the score command?

For the time being you will need to nest your score commands. Follow a pattern such as in this example with your own data.

| inputlookup track_day.csv 
| sample partitions=100 seed=1234 
| search partition_number > 70 
| apply example_vehicle_type as DT_prediction probabilities=true
| multireport 
[| score confusion_matrix vehicleType against DT_prediction] 
[| score roc_auc_score vehicleType against "probability(vehicleType=2013 Audi RS5)" pos_label="2013 Audi RS5"]

Model management

How often should I run fit to retrain models?

In general, you are unlikely to need to run a fit search to update a specific set of models in production more often than once a day. When you are exploring and experimenting, you may be running fit more frequently to iteratively create your production machine learning solutions.

You should consider following factors:

  • How often is your data significantly changing it's overall behavior?
  • How resource expensive is your base search before the machine learning commands (for example are you loading 3 billion events with your search over a 30 day window, and the search takes 45 minutes to load before the first fit command is called?)
  • How computationally intensive are your selected algorithms? Remember to check out the ML SPL performance app!

Consider accelerating your base search, perhaps using Data Models or Summary Indexes, to speed up the base search!

How do I manage version control for my model files?

We do not have model history as part of the MLTK today, but we do have Experiment history stored to reload any saved change made to your Experiment.

If you want to version your models, remember they are just Splunk lookup objects and follow all the rules of Splunk knowledge objects. You can rename your models just like any other lookup object.

Renaming of models in this instance refers to those outside of the Experiment Management Framework. Not models created within an Experiment.

How do I move my model files from one Splunk instance to another?

Model files are lookups in Splunk and follow all the rules for lookups - so you can find the files on disk and with command line access you can move those lookups to another Splunk instance. Click to learn more about namespacing and permissions of lookups.

How do I access Classic Assistant history?

In versions of the MLTK including version 3.2.0 you could access data using a Load Existing Settings UI. That UI is not present in more current versions of the MLTK but the data is available. You can try these steps to retrieve and access your older settings.

  1. Using Search input | kvstorelookup collection_name=<collection_name>.
  2. Replace collection_name value with the correct name for the Assistant used.
  3. The values from collection_name are one of the following: linear_regression_history, classification_history, categorical_outlier_detection_history, clustering_history, or forecast_history.

How do you feed data from an existing Splunk data model into the Machine Learning Toolkit?

This is done in the same way you search for data in a Data Model anywhere else in Splunk.

For example: | datamodel network_traffic search | search tag=destination

Remember that any data that can be retrieved by a Splunk search can be used with the Machine LearningToolkit, including data from indexes or third-party data sources . You simply append that search with the applicable | fit ... or | apply ... command.

Last modified on 16 September, 2021
Learn more about the Machine Learning Toolkit
Support for the Machine Learning Toolkit

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.3.0

Was this documentation topic helpful?

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters