Splunk® Machine Learning Toolkit

User Guide

This documentation does not apply to the most recent version of Splunk® Machine Learning Toolkit. For documentation on the most recent version, go to the latest release.

Smart Clustering Assistant

The Smart Clustering Assistant enables machine learning outcomes for users with little to no SPL knowledge. Use the Smart Clustering Assistant to see more clustering outcomes and better understand those outcomes over the Clustering Numeric Events Assistant.

Introduced in version 5.1.0 of the Machine Learning Toolkit, this new Assistant is built on the backbone of the Experiment Management Framework (EMF), offering enhanced clustering abilities. The Smart Clustering Assistant offers a segmented, guided workflow with an updated user interface. Move through the stages of Define, Learn, Review, and Operationalize to load data, build your model, and put that model into production. Each stage offers a data preview and visualization panel.

This Assistant leverages the K-means algorithm which persists a model using the fit command that can be used with the apply command. K-means is computationally faster than most other clustering algorithms. To learn more about the Smart Clustering Assistant algorithm, see K-means algorithm.

Smart Clustering Assistant Showcase

You can gain familiarity of this new Assistant through the MLTK Showcase, accessed under its own tab. The Smart Clustering Showcase examples include:

  • Cluster Houses by Property Descriptions
  • Cluster Mortgage Loans

This image shows the landing page for the Machine Learning Toolkit Showcase page. The Cluster Events option is highlighted and pointing to the available examples for the Smart Clustering Assistant.

Smart Clustering Assistant Showcases require you to click through to continue the demonstration. Showcases do not include the final stage of the Assistant workflow to Operationalize the model.

Smart Clustering Assistant workflow

Move through the stages of Define, Learn, Review, and Operationalize to draw in data, build your model, and put that model into production.

This example workflow uses the housing.csv dataset that ships with the MLTK. You can use this dataset or another of your choice to explore the Smart Clustering Assistant and its features before building a model with your own data.

To begin, select Smart Clustering from the Experiments landing page and the Create New Experiment button in the top right.

This image shows the Machine Learning Toolkit and the view under the Experiments tab. The Experiment types are displayed from which a user can create a new Experiment of that type. The new Experiment type of Smart Clustering Assistant is highlighted.

Enter an Experiment Title, and optionally add a Description. Click Create to move into the Assistant interface.

This image shows the resulting modal window that generates following clicking the Create New Experiment button. Fields are filled in for Experiment Title and Description and a button labeled Create is highlighted in the bottom right corner of the modal window.

Define

Use the Define stage to select and preview the data you want to cluster. You can pull in data from anywhere in the Splunk platform. Choose to use the Search bar to modify your dataset data in advance of using that data within the Learn step.

This image shows the Define stage of the Assistant. The Search bar is highlighted and contains an inputlookup for the housing dataset.

As an alternative to accessing data via Search, you can choose the Datasets option. Under Datasets, you can find any data you have ingested into Splunk, as well as any datasets that ship with Splunk Enterprise and the Machine Learning Toolkit. You can filter by type to find your preferred data faster.

This image shows the Define stage of the Assistant. This view is if the alternate option to getting data into the Assistant called Datasets. The View Datasets menu is open with the housing dataset selected from the list

As with other Experiment Assistants, the Smart Clustering Assistant includes a time-range picker to narrow down the data time-frame to a particular date or date range. The default setting of All time can be changed to suit your needs. Once data is selected, the Data Preview and Visualization tabs populate.

This image shows the Define stage of the Assistant. The menu option to change the default time range for the data from All time to another preset time frame or a custom time frame is open.

When you are finished selecting your data, click Next in the top right, or Learn from the left hand menu to move on to the next stage of the Assistant.

This image shows the Define stage of the Assistant. The left hand side menu option of Learn is highlighted. The green button labeled Next in the top right corner of the page is also highlighted.

Learn

Use the Learn stage to build your clustering model. The Learn stage includes sections from which you can see the ingested data, add one or more data preprocessing steps, and pick the fields to cluster and number of clusters to generate.

You can use the +Add preprocessing section to select from one of three preprocessing algorithms. For more detailed information on data preprocessing, see Getting your data ready for machine learning.

This image shows the Learn stage of the Assistant. The section for Add preprocessing step is selected showing the three options of PCA, Kernel PCA, and Standard Scaler.

Refer to the following table for information on the preprocessing algorithm options and the available fields for each of those algorithms.

Preprocessing algorithm Field name Field description
PCA Fields to preprocess Select the fields to preprocess.
Number of new fields to create Optional field. Specify the number of principal components. K new fields will be created with the prefix "PC_".
KernelPCA Fields to preprocess Select the fields to preprocess.
Number of new fields to create Optional field. Specify the number of principal components. K new fields will be created with the prefix "PC_".
Gamma Optional field. Kernel coefficient for the rbf kernel.
Tolerance Optional field. Convergence tolerance. If 0, an optimal value is chosen using arpack.
Max number of iterations Optional field. If not specified, an optimal value is chosen using arpack.
StandardScaler Fields to preprocess Select the fields to preprocess. For each selected field a new field will be created with the prefix "SS_".
Standardize Fields Select whether to center values with respect to the mean, scale them with respect to the standard deviation or both.

You can add multiple preprocessing steps depending on your machine learning needs.

Select which fields to cluster from the drop-down list. There is no limit to the number of fields you select.

This image shows the Learn stage of the Smart Clustering Assistant. The drop-down list populated with available fields to cluster is highlighted.

Input the number of clusters to generate and optionally use the Notes field to track parameter adjustments you make to your Smart Clustering Experiment. Refer back to notes to review which parameter combinations yield the best results. Hit Find Clusters when ready.

This image shows the Learn stage of the Smart Clustering Assistant. Input fields for number of clusters and notes are highlighted. The button labaled Find Clusters is also highlighted.

A summary of your selected settings appears at the top of the page summarizing the generated number of clusters from which fields. The Experiment is now in a Draft state, and the View History option is available. View History allows you to track any changes you make in the Learn stage.

This image shows the Learn stage of the Smart Clustering Assistant. A row at the top of the screen is highlighted displaying a plain English summary of the fields selected for clustering and the number of clusters chosen. A button labeled View History is also highlighted.

The SPL button is also available as a means to review the Splunk Search Processing Language being auto-generated for you in the background as you work through the Assistant.

This image shows the resulting modal window from clicking the button labeled SPL. The window displays the Splunk Search Processing Language generated for you as you work through the Assistant.

On the resulting Evaluate tab, view your settings in a 2D or 3D scatter plot. Use the X, Y, and Z axis drop-down list to populate the scatter plots.

This image shows the 3 dimensional scatter plot on the Evaluate tab of the Learn stage. A drop-down list showing the ability to select values for the x, y, and z axis is highlighted.

Examine these scatter plots by hovering over any point or by showing/ hiding clusters by hovering over cluster numbers left hand side legend.

This image shows the Evaluate tab of the Learn stage. The left hand side legend listing the resulting clusters by number is highlighted. A single data point showing specific values is also highlighted.

Clicking a specific data points opens a New Search screen for further data examination.

This image shows the resulting New Search page that's generated when users click on a specific data point within a scatter plot.

The Evaluate tab also generates and displays the Silhouette score for the model. Silhouette score measures both the distance from the cluster centroid and distance between centroids. The Silhouette score ranges from -1 to +1 with a score closer to 1 indicating a better clustering configuration. A negative score could indicate you have selected the wrong fields to cluster by.

This image shows the Evaluate tab of the Learn stage. The value and helper icon information for the Silhouette score is highlighted.

When you are happy with your results, click Next in the top right, or Review from the left hand menu to move on to the next stage of the Assistant.

This image shows the Evaluate tab of the Learn stage. The button labeled Next is highlighted.

Review

Use the Review stage to explore the resulting model based on the fields selected at the Learn stage. The Review panels give you the opportunity to assess your clustering results prior to putting the model into production. There are three panels in this stage:

  • Number of Clusters
  • Intercluster Distance Matrix
  • Intracluster Distance Distribution

Use the Number of Clusters panel to review the details of cluster points across all clusters or within an individual cluster. This panel offers two tables of data, one for All Cluster Details and the other for Points in All Clusters. Use the panel filter to change the view from All Clusters to that of any of the generated clusters.

This image shows the Review stage of the Smart Clustering Assistant. The panel labeled Number of Clusters is selected. A drop-down menu from which users can choose to filter the resulting tables of data by all clusters or a specific cluster is highlighted.

Use the Intercluster Distance Matrix panel to examine the relationship between the found clusters. The panel displays the average, maximum, and minimum distances. Scroll down to view and optionally filter a bar graph display of distances between found clusters. Hover over any bar within the graph to see the specific distance value. Use the panel filter to change the view from All Clusters to that of any of the generated clusters.

This image shows the Review stage of the Smart Clustering Assistant. The panel labeled Intercluster Distance Matrix is selected.  One segment of the resulting bar graph is highlighted, showing how data specific to that segment displays when hovered over by a user.

Use the Intracluster Distance Distribution panel to set a distance from the centroid within a cluster to find outliers. A value must be selected in order to generate a visualization. Choose to enter a value into the open field or use the available slider. Click Set Distance once a value is chosen.

This image shows the Review stage of the Smart Clustering Assistant. The panel labeled Intracluster Distance Distribution is selected. This panel defaults to have no value until the user sets the distance from centroid using the highlighted slider bar. A visualization is generated once the distance from centroid value is set.

Once a distance from centroid value is set you can use the panel filter to change the view from All Clusters to that of any of the generated clusters. Within the resulting visualization, hover over the cluster legend to limit the visualization to that cluster or hover over values within the visualization to view the specific distance value.

This image shows the Review stage of the Smart Clustering Assistant. The panel labeled Intracluster Distance Distribution is selected. A drop-down menu from which users can choose to filter the resulting visualization by all clusters or a specific cluster is highlighted.

Scroll down to see the Outlier Table for All Clusters. Use the icon beside the table name to open this view in a new tab.

This image shows the Review stage of the Smart Clustering Assistant. The panel labeled Intracluster Distance Distribution is selected. An arrow icon beside the resulting table name labeled Outlier Table for All Clusters is highlighted. Clicking this arrow opens this table in a new tab for further analysis.

Navigate back to the Learn stage to make clustering adjustments, or click Save and Next to continue. Clicking Save and Next generates a modal window that offers the opportunity to update the Experiment name or description. When ready, click Save.

This image shows the Review stage of the Smart Clustering Assistant. The Save and Next button is selected and a modal window from which users can save this Smart Clustering Experiment is displayed.

Operationalize

The Operationalize stage provides publishing, alerting, and scheduled training in one place. Click Done to move to the Experiments listings page.

This image shows the Operationalize stage of the Assistant. Options on this page include Publish Outlier Models, Create Alert, Manage Alerts, Schedule Model Training, and View Scheduled Training Jobs. A green button labeled Done in the top right of the page is highlighted.

The Experiments listing page provides a place to publish, set up alerts, and schedule training for any of your saved Experiments across all Assistant types including Smart Clustering.

This image shows the Experiment list view page. The outlier detection Experiment created in this document is listed, and options to Manage and Publish the Experiment are highlighted.

Learn more

To learn about implementing analytics and data science projects using Splunk's statistics, machine learning, built-in and custom visualization capabilities, see the Splunk for Analytics and Data Science course.

Last modified on 09 April, 2020
Smart Outlier Detection Assistant   Predict Numeric Fields Experiment Assistant workflow

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.1.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters