Splunk® Machine Learning Toolkit

User Guide

Download manual as PDF

Download topic as PDF

Preprocessing your data using MLTK Assistants

Guided modeling Assistants bring all aspects of a monitored machine learning pipeline into one interface. Preprocessing data can be an important part of a machine learning workflow as these steps can transform your data into fields that are more suitable for modeling or visualization.

This image shows the view under the Experiments tab of the MLTK. Here each of the available MLTK guided modelling Assistants are displayed with any saved models using those Assistants.

For information on all the available preprocessing algorithm options you can use outside of the Assistant frameworks, see Algorithms in the Machine Learning Toolkit.

Preprocessing options by MLTK Assistant

All MLTK Smart Assistants and a selection of Experiment Assistants include a data preprocessing option within the guided workflow. As with other aspects of MLTK guided modeling Assistants, any preprocessing steps taken also generate Splunk Search Processing Language (SPL) for you that can be viewed using the SPL button within each Assistant.

As you apply one or more preprocessing steps, review the results against your data. You can modify a preprocessing step as long as no other steps have been added. Once you add another preprocessing step, previous steps cannot be modified. Removing a preprocessing step also removes all subsequent preprocessing steps.

Smart Forecasting Assistant

The Smart Forecasting Assistant offers the preprocessing option to join special time entries using a Lookup file at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Forecasting Assistant.

Smart Outlier Detection Assistant

The Smart Outlier Detection Assistant offers the preprocessing option to extract time from data at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Outlier Detection Assistant.

Smart Clustering Assistant

The Smart Clustering Assistant offers the preprocessing algorithm options of PCA, Kernel PCA, and StandardScaler at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Clustering Assistant.

Smart Prediction Assistant

The Smart Prediction Assistant offers the preprocessing algorithm options of PCA, Kernel PCA, and FieldSelector at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Prediction Assistant.

Predict Numeric Fields Experiment Assistant

The Predict Numeric Fields Experiment Assistant offers the preprocessing algorithm options of StandardScaler, FieldSelector, PCA, Kernel PCA, and TFIDF in the workflow. The default option is StandardScaler.

To learn more about this Assistant, see Predict Numeric Fields Experiment Assistant.

Predict Categorical Fields Experiment Assistant

The Predict Categorical Fields Experiment Assistant offers the preprocessing algorithm options of StandardScaler, FieldSelector, PCA, Kernel PCA, and TFIDF in the workflow. The default option is StandardScaler.

To learn more about this Assistant, see Predict Categorical Fields Experiment Assistant.

Cluster Numeric Events Experiment Assistant

The Cluster Numeric Events Experiment Assistant offers the preprocessing algorithm options of StandardScaler, FieldSelector, PCA, Kernel PCA, and TFIDF in the workflow. The default option is StandardScaler.

To learn more about this Assistant, see Cluster Numeric Events Experiment Assistant.

About the preprocessing algorithms within MLTK Assistants

Preprocessing algorithms include options that can reduce the number of fields in the data, produce numeric fields from unstructured text, and re-scale numeric fields. Choose the algorithm to best suit the preprocessing needs of your data. Use this section to help guide your decision:

FieldSelector

The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.

To learn more about this algorithm, see FieldSelector.

Field name Field description
Field to predict/ Target variable Make a single selection from list.
Predictor fields/ Future variable Select one or more fields from list.
Type Select if data is categorical or numeric.
Mode Select from field selection modes including Percentile, K-best, False positivity rate, False discovery rate, and Family-wise error rate.

KernelPCA

The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. It is strongly recommended to standardize fields using StandardScaler before using the KernelPCA method.

To reduce the number of dimensions, use the KernelPCA or PCA algorithms to increase performance. KernelPCA and PCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.

To learn more about this algorithm, see KernelPCA.

Field name Field description
Fields to preprocess Select one or more fields.
K # of components/ # new fields to create Specify the number of principal components. K new fields are created with the prefix of "PC_".
Gamma Enter the kernel coefficient for the rbf kernel.
Tolerance Enter the convergence tolerance. If 0, an optimal value is chosen using arpack.
Max iteration Enter the maximum number of iterations. If not specified, an optimal value is chosen using arpack.

PCA

The PCA algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data. To reduce the number of dimensions, use the PCA or KernelPCA algorithms to increase performance.

PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.

To learn more about this algorithm, see PCA.

Field name Field description
Fields to preprocess Select one or more fields.
K # of componenets/ # new fields to create Specify the number of principal components. K new fields are created with the prefix of "PC_".

StandardScaler

The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms.

StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both.

To learn more about this algorithm, see StandardScaler.

Field name Field description
Fields to preprocess Select one or more fields. Any new fields are created with the prefix of "SS_".
Standardize fields Specify whether to center values with respect to mean, to scale values with respect to the standard deviation, or both.

TFIDF

The TFIDF algorithm converts raw text into numeric fields, making it possible to use that data with other machine learning algorithms. The TFIDF algorithm selects N-grams, which are groups of N consecutive string (or term), from fields containing free-form text and converts them into numeric fields amenable to machine learning. For example, running TFIDF on a field containing email Subjects might select the bi-gram 'project proposal' and create a field indicating the weighted frequency of that bi-gram in each Subject.

To learn more about this algorithm, see TFIDF.

Field name Field description
Fields to preprocess Select one or more fields.
Max features Build a vocabulary that only considers the top K features ordered by term frequency.
Max document frequency Ignore terms that have a document frequency strictly higher than the given threshold. Field supports a value type of integer or float.
Min document frequency (cut-off) Ignore terms that have a document frequency strictly lower than the given threshold. Field supports a value type of integer or float.
N-gram range The lower and upper boundary of the range of N-values for different N-grams to be extracted.
Analyzer Select whether the feature is made of word or character N-grams. This field defaults as set to word. Choose "char" to treat each letter like a word, resulting in sub-strings of N consecutive characters, including spaces.
Norm Norm used to normalize term vectors.
Token pattern Regular expression denoting what constitutes a "token".
Stop words Enter any words you want to omit from the analysis. Stop words typically include common words such as "the" or "an".
Last modified on 10 August, 2020
PREVIOUS
Preparing your data for machine learning
  NEXT
Smart Forecasting Assistant

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 5.2.0


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters