Splunk® Machine Learning Toolkit

User Guide

Acrobat logo Download manual as PDF


This documentation does not apply to the most recent version of Splunk® Machine Learning Toolkit. For documentation on the most recent version, go to the latest release.
Acrobat logo Download topic as PDF

Cluster Numeric Events

MLApp cluster.png

The Cluster Numeric Events assistant partitions events with multiple numeric fields into groups of events based on the values of those fields. The groupings aren't known in advance (that is, the learning is unsupervised).

Algorithms

Workflow

To cluster numeric events, you input data, optionally select fields to preprocess and perform preprocessing, then select the algorithm to use for clustering and other parameters as necessary. The basic steps to cluster events are as follows:

Search

  1. Enter a search to retrieve your data, then click the search button to run it.
    A data preview is generated so you can preview the data.

Preprocessing

Preprocessing is optional. Preprocessing is useful if the data contains a large number of fields or if the fields have various scales. If the clustering of events takes an extremely long time, preprocessing is recommended.

To perform preprocessing before clustering, do the following:

  1. Click the Preprocess check box and specify the fields to preprocess before clustering.
  2. Select the preprocessing method to use: StandardScaler and/or PCA or KernelPCA.
    StandardScaler is useful when the fields have very different scales. StandardScaler makes the average of each field equal to 0 and its standard deviation equal to 1.
    If you have too many fields, the performance of some algorithms can drop drastically. For this case, use PCA or KernelPCA to reduce the number of dimensions. Specify the number of fields by which to reduce dimensionality.
  3. Click the Preprocess button to perform the specified preprocessing.
    The fields resulting from the preprocessing are displayed in a visualization. You can add or remove fields and click the Visualize button to see how it affects the visualization. The fields you visualize here need not be the same as those you use for clustering. However, if you'd like to cluster on different fields, you must change that above, as well.
    Note: Fields processed using StandardScaler are prefixed with "SS_", thus if the crime_rate field is selected for preprocessing, and the Apply StandardScaler box is checked, the standardized field will be called SS_crime_rate. If you select PCA or KernelPCA, the processed fields will be renamed "PC_<n>", for example, PC_1, PC_2.

Clustering

In the Cluster section, do the following:

  1. Select the algorithm to use.
  2. Specify the fields to use. If your data has been preprocessed, you should choose from the preprocessed fields.
  3. For K-means, Birch, and Spectral Clustering, specify the number of clusters to use. For DBSCAN, specify a value between 0 and 1 for eps (the size of the neighborhood). Smaller numbers result in more clusters.
  4. Name the model if you would like to save it. You must specify a name for the model in order to schedule clustering or schedule an alert. This name and the settings you select are saved in the history in the Load Existing Settings tab.
    Note: You cannot save a model if you use the DBSCAN or Spectral Clustering algorithm.
  5. Click Cluster.

Interpret and validate

After the numeric events are clustered, review the cluster visualization. The fields included in the visualization are listed. You can add and remove fields then click Visualize to change the visualization.

You can drag a selection rectangle around some of the points in a plot to see the corresponding points on the other plots.

MLApp selectionrectangle.png

Note: The visualization displays a maximum of 1000 points, 20 series and 6 fields (1 label and 5 variables).

Deploy clustering

Once you have created the clusters, you can take the following actions:

  • Click the icon in the right part of the Cluster button to run the clustering on a schedule.
    Mla cluster schedule.png
    You can set up a regular interval to run clustering, such as every week. After saving the schedule, you can access it from the Scheduled Jobs > Scheduled Training menu.
    Note: You cannot schedule clustering if you use the DBSCAN or Spectral Clustering algorithms or if you do not specify a name for the model.
  • Click the Open in Search button next to the Cluster button to open a new Search tab, filled out with the search query that was used to fit the model.
  • Click the Show SPL button next to the Cluster button to see the search query that was used for the clustering with comments that contain explanations. For example, you could use this same query on a different data set.
  • Click the Schedule Alert button beneath the cluster visualization to set up an alert that triggers when the number of events in a cluster exceeds a threshold you specify. After you save the alert, you can access it from the Scheduled Jobs > Alerts menu. For more information about alerts, see Getting started with alerts in the Splunk Enterprise Alerting Manual.
    Note: Alerts cannot be scheduled if you use the DBSCAN or Spectral Clustering algorithms or if you do not specify a name for the model.
Last modified on 04 April, 2017
PREVIOUS
Forecast Time Series
  NEXT
What's new

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 2.0.1, 2.1.0


Was this documentation topic helpful?


You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters