Preprocessing your data using MLTK Assistants

Guided modeling Assistants bring all aspects of a monitored machine learning pipeline into one interface. Preprocessing data can be an important part of a machine learning workflow as these steps can transform your data into fields that are more suitable for modeling or visualization.

For information on all the available preprocessing algorithm options you can use outside of the Assistant frameworks, see Algorithms in the Machine Learning Toolkit.

Preprocessing options by MLTK Assistant

All MLTK Smart Assistants and a selection of Experiment Assistants include a data preprocessing option within the guided workflow. As with other aspects of MLTK guided modeling Assistants, any preprocessing steps taken also generate Splunk Search Processing Language (SPL) for you that can be viewed using the SPL button within each Assistant.

As you apply one or more preprocessing steps, review the results against your data. You can modify a preprocessing step as long as no other steps have been added. Once you add another preprocessing step, previous steps cannot be modified. Removing a preprocessing step also removes all subsequent preprocessing steps.

Smart Forecasting Assistant

The Smart Forecasting Assistant offers the preprocessing option to join special time entries using a Lookup file at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Forecasting Assistant.

Smart Outlier Detection Assistant

The Smart Outlier Detection Assistant offers the preprocessing option to extract time from data at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Outlier Detection Assistant.

Smart Clustering Assistant

The Smart Clustering Assistant offers the preprocessing algorithm options of PCA, Kernel PCA, and StandardScaler at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Clustering Assistant.

Smart Prediction Assistant

The Smart Prediction Assistant offers the preprocessing algorithm options of PCA, Kernel PCA, and FieldSelector at the Learn stage of the Assistant workflow.

To learn more about this Assistant, see Smart Prediction Assistant.

Predict Numeric Fields Experiment Assistant

The Predict Numeric Fields Experiment Assistant offers the preprocessing algorithm options of StandardScaler, FieldSelector, PCA, Kernel PCA, and TFIDF in the workflow. The default option is StandardScaler.

To learn more about this Assistant, see Predict Numeric Fields Experiment Assistant.

Predict Categorical Fields Experiment Assistant

The Predict Categorical Fields Experiment Assistant offers the preprocessing algorithm options of StandardScaler, FieldSelector, PCA, Kernel PCA, and TFIDF in the workflow. The default option is StandardScaler.

To learn more about this Assistant, see Predict Categorical Fields Experiment Assistant.

Cluster Numeric Events Experiment Assistant

The Cluster Numeric Events Experiment Assistant offers the preprocessing algorithm options of StandardScaler, FieldSelector, PCA, Kernel PCA, and TFIDF in the workflow. The default option is StandardScaler.

To learn more about this Assistant, see Cluster Numeric Events Experiment Assistant.

About the preprocessing algorithms within MLTK Assistants

Preprocessing algorithms include options that can reduce the number of fields in the data, produce numeric fields from unstructured text, and re-scale numeric fields. Choose the algorithm to best suit the preprocessing needs of your data. Use this section to help guide your decision:

FieldSelector

The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.

To learn more about this algorithm, see FieldSelector.

Field name	Field description
Field to predict/ Target variable	Make a single selection from list.
Predictor fields/ Future variable	Select one or more fields from list.
Type	Select if data is categorical or numeric.
Mode	Select from field selection modes including Percentile, K-best, False positivity rate, False discovery rate, and Family-wise error rate.

KernelPCA

The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. It is strongly recommended to standardize fields using StandardScaler before using the KernelPCA method.

To reduce the number of dimensions, use the KernelPCA or PCA algorithms to increase performance. KernelPCA and PCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.

To learn more about this algorithm, see KernelPCA.

Field name	Field description
Fields to preprocess	Select one or more fields.
K # of components/ # new fields to create	Specify the number of principal components. K new fields are created with the prefix of "PC_".
Gamma	Enter the kernel coefficient for the rbf kernel.
Tolerance	Enter the convergence tolerance. If 0, an optimal value is chosen using arpack.
Max iteration	Enter the maximum number of iterations. If not specified, an optimal value is chosen using arpack.

PCA

The PCA algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data. To reduce the number of dimensions, use the PCA or KernelPCA algorithms to increase performance.

PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.

To learn more about this algorithm, see PCA.

Field name	Field description
Fields to preprocess	Select one or more fields.
K # of componenets/ # new fields to create	Specify the number of principal components. K new fields are created with the prefix of "PC_".

StandardScaler

The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms.

StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both.

To learn more about this algorithm, see StandardScaler.

Field name	Field description
Fields to preprocess	Select one or more fields. Any new fields are created with the prefix of "SS_".
Standardize fields	Specify whether to center values with respect to mean, to scale values with respect to the standard deviation, or both.

TFIDF

The TFIDF algorithm converts raw text into numeric fields, making it possible to use that data with other machine learning algorithms. The TFIDF algorithm selects N-grams, which are groups of N consecutive string (or term), from fields containing free-form text and converts them into numeric fields amenable to machine learning. For example, running TFIDF on a field containing email Subjects might select the bi-gram 'project proposal' and create a field indicating the weighted frequency of that bi-gram in each Subject.

To learn more about this algorithm, see TFIDF.

Field name	Field description
Fields to preprocess	Select one or more fields.
Max features	Build a vocabulary that only considers the top K features ordered by term frequency.
Max document frequency	Ignore terms that have a document frequency strictly higher than the given threshold. Field supports a value type of integer or float.
Min document frequency (cut-off)	Ignore terms that have a document frequency strictly lower than the given threshold. Field supports a value type of integer or float.
N-gram range	The lower and upper boundary of the range of N-values for different N-grams to be extracted.
Analyzer	Select whether the feature is made of word or character N-grams. This field defaults as set to word. Choose "char" to treat each letter like a word, resulting in sub-strings of N consecutive characters, including spaces.
Norm	Norm used to normalize term vectors.
Token pattern	Regular expression denoting what constitutes a "token".
Stop words	Enter any words you want to omit from the analysis. Stop words typically include common words such as "the" or "an".

Related answers from Splunk Community

Preprocessing your data using MLTK Assistants

Preprocessing options by MLTK Assistant

Smart Forecasting Assistant

Smart Outlier Detection Assistant

Smart Clustering Assistant

Smart Prediction Assistant

Predict Numeric Fields Experiment Assistant

Predict Categorical Fields Experiment Assistant

Cluster Numeric Events Experiment Assistant

About the preprocessing algorithms within MLTK Assistants

FieldSelector

KernelPCA

PCA

StandardScaler

TFIDF

Comments

Preprocessing your data using MLTK Assistants

Was this topic useful?