Splunk® Machine Learning Toolkit

User Guide

This documentation does not apply to the most recent version of Splunk® Machine Learning Toolkit. For documentation on the most recent version, go to the latest release.

Welcome to the Machine Learning Toolkit

The Machine Learning Toolkit (MLTK) is an app available for both Splunk Enterprise and Splunk Cloud users through Splunkbase. The Machine Learning Toolkit acts like an extension to the Splunk platform and includes new Search Processing Language (SPL) search commands, macros, and visualizations. On top of the platform extensions meant for machine learning, the MLTK has guided modeling dashboards called Assistants. Assistants walk you through the process of performing particular analytics.

The Machine Learning Toolkit is not a default solution, but a way to create custom machine learning. You must have have domain knowledge, SPL knowledge, Splunk experience, and data science skills or experience to use the toolkit.

Types of machine learning

Machine learning is a process for generalizing from examples. You can use these generalizations, typically called models, to perform a variety of tasks, such as predicting the value of a field, forecasting future values, identifying patterns in data, and detecting anomalies from new (never-before-seen) data. Without data and correct examples, it is difficult for machine learning to work at all.

The machine learning process starts with a question. You might ask one of these questions:

  • Am I being hacked?
  • How hot are the servers?
  • How many visits to my site do I expect in the next hour?
  • What is the price range of houses in a particular neighborhood?

There are different types of machine learning, including:

Regression

Regression modeling predicts a number from a variety of contributing factors. Regression is a predictive analytic. For example, you might have utilization metrics on a machine, such as CPU percentage and amount of disk reads and writes. You can use regression modeling to predict the amount of power that that machine is likely to draw both now and in the future.

Example

... | fit DecisionTreeRegressor temperature from date_month date_hour into temperature_model

Classification

Classification modeling predicts a category or a class from a number of contributing factors. Classification is a predictive analytic. For example, you might have data on user behavior on a website or within a software product. You can use classification modeling to predict whether that customer is going to churn.

Example

... | fit RandomForestClassifier SLA_violation from * into sla_model

Forecasting

Forecasting is a predictive analytic that predicts the future of a single value moving through time. Forecasting looks at past measurements for a single value, like profit per day or CPU usage per minute, to predict future values. For example, you might have sales results by quarter for the past 5 years. Use forecast modeling to predict sales for the upcoming quarter.

Example

... | fit ARIMA Voltage order=4-0-1 holdback=10 forecast_k=10

Clustering

Clustering groups similar points of data together. For example, you might want to group customers together based on their buying behaviors, such as how much they tend to spend or how many items they buy at one time. Use cluster modeling to group together the features you specify.

Example

... | fit XMeans * | stats count by cluster

Anomaly detection

Anomaly detection finds outliers in your data by computing an expectation based on one of the machine learning types, comparing it with reality, and triggering an alert when the discrepancies between the two values is large.

Example

... | fit OneClassSVM * kernel="poly" nu=0.5 coef0=0.5 gamma=0.5 tol=1 degree=3 shrinking=f into Model_Name

How the toolkit fits into the Splunk platform

Different machine learning options exist in the Splunk platform.

This image shows a graphic representing the three parts of the Splunk platform that include machine learning: Core Platform, Packaged Solutions, and the MLTK.

  • The Splunk platform has machine learning capabilities integrated in the SPL.
  • The Machine Learning Toolkit is an app in the Splunkbase ecosystem that allows you to build custom machine learning solutions for any use case.

The machine learning process

Machine learning is a process for generalizing from examples. Ideally, the machine learning process follows a series of steps, beginning collecting data and ending with deploying your machine learning model.

This image shows a graphic representing six steps in the machine learning process, including collect data, clean/ transform, explore/ visualize, model, evaluate, and deploy. There is one arrow leading from step to step in order.

  1. Collect available data like CPU percentages, memory utilization, server temperatures, disk space, or sales values.
  2. Clean and transform that data. All machine learning expects a matrix of numbers as input. If you're collecting data that is missing values, then you need to clean and transform that data until it's in the form machine learning requires.
  3. Explore and visualize the data to make sure it is encoding what you expect it to encode.
  4. Build a model on training data.
  5. Evaluate the model test data.
  6. Deploy the model on un-seen data.

The machine learning process follows those steps in theory, but in practice it's rarely linear:

This image shows the graphic of the six steps in the machine learning process but now arrows are criss-crossed between all steps. There is no longer a clear, linear path from a first step to a final step.

You might evaluate your model, discover the performance is not generating the results your expect, and go back to further clean and transform the data. Maybe the data has too many missing values, lacks completeness, is improperly weighted, or has unit disagreement. Iterate around the machine learning process until the model is providing desired machine learning outcomes.

The machine learning process is potentially time consuming and could mean a need for different tools, different team members, and context switching. But with the Machine Learning Toolkit, this entire process, from ingesting the data to building reports, can all occur inside the Splunk platform.

Layers that make up the Machine Learning Toolkit

The Splunk Machine Learning Toolkit is comprised of several layers, each of which aid in the process of building a generalization from your data.

Python for Scientific Computing Library

Access to over 300 open source algorithms.

The MLTK is built on top of the Python for Scientific Computing Library. This ecosystem includes the most popular machine learning library called sci-kit learn, as well as other supporting libraries like NumPy and Statsmodels.

ML-SPL API

Extensibility to easily support any algorithm, whether proprietary or open source.

The toolkit includes access to an extensibility API that not only allows you to expose the 300+ algorithms from the Python app, but also custom algorithms you can write yourself. You can share and reuse your own custom algorithms in the Splunk Community for MLTK on GitHub.

ML Commands

New SPL commands to fit, test, score, and operationalize models.

The toolkit uses the PSC app to expose new SPL commands that let you do machine learning. These are SPL commands for building (fit), testing (apply), and validating (score) models and more.

Algorithms

Over 30 standard algorithms to accommodate both supervised and unsupervised learning.

These SPL commands expose different algorithms. There are more than 3 standard algorithms, which are the most commonly used machine learning algorithms.

Showcases

Interactive examples for typical IT, security, business, and IoT use cases.

The toolkit includes many interactive examples from different domains like IoT and business analytics. This is called the Showcase. The Showcase is the initial page you land on when you open the MLTK, and it lets you see different examples from end to end.

Experiments and Assistants

Guided model building, testing, and deployment for common objectives.

On top of the tools for machine learning are guided modeling dashboards called Assistants and a management framework called Experiments. Assistants are a guided walk-through of the process for performing particular analytics with your own data. The Experiment Management Framework (EMF) brings all aspects of a monitored machine learning pipeline integrated to one interface with automated model versioning and lineage.

What is included in the toolkit

The MLTK includes several platform extensions. This includes a set of macros that allow you to compute validation statistics as well as a set of custom visualizations. The Machine Learning Toolkit provides the key commands of fit and apply:

  • fit builds the generalization or model
  • apply lets you use that generalization later

Commands

  • fit
  • apply
  • summary
  • listmodels
  • deletemodel
  • sample
  • score

Macros

  • regressionstatistics
  • classificationstatistics
  • classificationreport
  • confusionmatrix
  • forecastviz
  • histogram
  • modvizpredict
  • splitby (1-5)

Visualizations

  • Outliers Chart
  • Forecast Chart
  • Scatter Line Chart
  • Histogram Chart
  • Downsampled Line Chart
  • Scatterplot Matrix

Other commands, including summary, listmodels, and deletemodel, manage or inspect the models you've created. The toolkit also offers the sample command that lets you take random partitions of your data. The score command is available to run statistical tests to validate model outcomes.

This graphic shows the fit, apply, summary, and score commands with their basic syntax as well as an example.

To learn more about the fit and apply commands, see Understanding the fit and apply commands.
To learn more about the score command see, Using the score command.

Introduction to the Assistants

The Assistants guide you through the process of performing an analytic in the toolkit. Assistants use all tools and extensions, including both SPL and ML-SPL. You can click Show SPL at any time within an Assistant to see the SPL.

Image of the Predict Numeric Fields Assistant which has run server power related data. An arrow is pointing to the action button labeled Show SPL.

Image is a drilled in view of the resulting SPL window generated after clicking the Show SPL button.

The Assistants walk you through the entire machine learning process:

  • Preparing your data. Examples include replacing or computing missing values, and re-scaling fields with unit disagreement.
  • Building the model.
  • Validating the model. For example, when you're building a regression model, the preferred way to validate that kind of model is looking at the R2 (squared).
  • Deploying the model. If you're happy with the model and want to ensure it continues to make accurate predictions, you might fit it on a schedule, and then inspect that schedule later.

Each type of machine learning has an accompanying Assistant:

Type of machine learning Assistant
Regression Predict Numeric Fields Assistant
Classification Predict Categorical Fields Assistant
Anomaly detection Detect Numeric Outliers Assistant
Anomaly detection Detect Categorical Outliers Assistant
Forecasting Forecast Time Series Assistant

Smart Forecasting Assistant

Clustering Cluster Numeric Events Assistant

Learn more

Continue learning about the MLTK by working through this user guide or through the following links:

Last modified on 30 May, 2019
New to Splunk   About the Machine Learning Toolkit

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 4.3.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters