Preprocessing machine data using MLTK Assistants

Preprocessing steps transform your machine data into fields ready for modeling or visualization. Preprocessing steps include algorithms that reduce the number of fields, produce numeric fields from unstructured text, or re-scale numeric fields.

This document covers preprocessing in the Experiments and Classic Assistant context. For information on all the available preprocessing algorithm options to use in the MLTK, please see the preprocessing section of the algorithms document.

When you use the Predict Numeric Fields, Predict Categorical Fields, or Cluster Numeric Events Assistants in the MLTK, you have the option to apply one or more preprocessing algorithms. Choose the algorithm to best suit the preprocessing needs of your data. There are five preprocessing algorithm options:

Preprocessing steps are included in the Smart Forecasting Assistant but not covered in this document. For details, see Smart Forecasting Assistant.

Apply preprocessing to your data

The following steps are the same for both Experiment and Classic Assistant workflows:

Under the Preprocessing Steps section, click +Add a step.
From the Preprocess method drop-down menu, choose FieldSelector, KernelPCA, PCA, StandardScaler or TFIDF. Fill in the fields for the selected method.
Click Apply to perform the specified preprocessing.
Click Preview Results to see a table with the preprocessing results, including any newly created fields.

If	Then
You wish to add more than one preprocessing step:	Click +Add a step again in the Preprocessing Steps section. The preorocessing step includes new fields or settings generated by previous preprocessing steps. Additional preprocessing applies sequential transformations. After you apply each preprocessing step, review the results against your data. You can modify a preprocessing step as long as no other steps have been added. Once you add another preprocessing step, previous steps cannot be modified. You can remove a preprocessing step, but doing so also removes all subsequent preprocessing steps as well as any fields selected in later sections of the Assistant for model training and fitting.
You are satisfied with the preprocessing results:	Continue through the remaining Assistant sections to complete training and testing the model. When you click the `fit` button, it will both fit that model as well as any preprocessing steps. When you save the main model (available in the experiments workflows) preprocessing models are saved as well.
You are not satisfied with the results of your preprocessing:	Try to remove the preprocessing step, modify the settings in the preprocessing step, or add more preprocessing steps to apply more data transformations.

About the preprocessing algorithms

Choose the algorithm to best suit the preprocessing needs of your data. Refer to this list to guide your decision:

FieldSelector
The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.

KernelPCA
The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. It is strongly recommended to standardize fields using StandardScaler before using the KernelPCA method.
To reduce the number of dimensions, use the KernelPCA or PCA algorithms to increase performance. KernelPCA and PCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.

PCA
The PCA algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data. To reduce the number of dimensions, use the PCA or KernelPCA algorithms to increase performance. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.

StandardScaler
The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms.
StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both.

TFIDF
The TFIDF algorithm converts raw text into numeric fields, making it possible to use that data with other machine learning algorithms. The TFIDF algorithm selects N-grams, which are groups of N consecutive string (or term), from fields containing free-form text and converts them into numeric fields amenable to machine learning. For example, running TFIDF on a field containing email Subjects might select the bi-gram 'project proposal' and create a field indicating the weighted frequency of that bi-gram in each Subject.

Fields specific to each preprocessing algorithm

Each of the preprocessing algorithms has its own unique fields and field order. Within the MLTK, hover over any field to see more information, or review content below.

FieldSelector

Select the field to predict: Make a single selection from the drop-down menu.
Select the predictor fields: Click to choose one at a time, or choose to select all.
Type: Select categorical or numeric as the field to predict.
Mode: Select the mode for field selection.
- Percentile: Select a percentage of fields with the highest scores.
- K-best: Select the K fields with the highest scores.
- False positive rate: Select fields with p-values below alpha based on a false positive rate (FPR) test.
- False discovery rate: Select fields with p-values below alpha for an estimated false discovery rate (FDR) test.
- Family-wise error rate: Select fields with p-values below alpha based on family-wise error rate (FWER).
Percent: Input a percent value of features to return.

KernelPCA

If you select 'KernelPCA as your preprocess method, the processed fields will be renamed PC_, for example, PC_1, PC_2.

Select the fields to preprocess: Click to choose one at a time, or choose to select all.
K (# of Components): Specify the number of principal components.
Gamma: Enter the kernel coefficient for the rbf kernel.
Tolerance: Enter the convergence tolerance. If 0, an optimal value is chosen using arpack.
Max iteration: Enter the maximum number of iterations. If not specified, an optimal value is chosen using arpack.

PCA

If you select PCA as your preprocess method, the processed fields are renamed PC_, for example: PC_1, PC_2.

Select the fields to preprocess: Click to choose one at a time, or choose to select all.
K (# of Components): Specify the number of principal components.

StandardScaler

Fields processed using StandardScaler are prefixed with SS_. For example, if you select StandardScaler as the preprocessing method and the crime_rate field for preprocessing, the standardized field is named SS_crime_rate

Select the fields to preprocess: Click to choose one at a time, or choose to select all.
Standardize Fields: Select whether to center values with respect to the mean, scale them with respect to the standard deviation, or both.

TFIDF

Select the field to preprocess: Make a single selection from the drop-down menu.
Max features: Build a vocabulary that only considers the top K features ordered by term frequency.
Max document frequency: Ignore terms that have a document frequency strictly higher than the given threshold.
This field supports one of the following value types:
- Integer: absolute count
- Float: a frequency of documents (between 0 and 1)
Min document frequency (cut-off): Ignore terms that have a document frequency strictly lower than the given threshold.
This field supports one of the following value types:
- Integer: absolute count
- Float: a frequency of documents (between 0 and 1)
N-gram range: The lower and upper boundary of the range of N-values for different N-grams to be extracted.
Analyzer: Select whether the feature is made of word or character N-grams. This field defaults as set to word. Choose "char" to treat each letter like a word, resulting in sub-strings of N consecutive characters, including spaces.
Norm: Norm used to normalize term vectors.
Token pattern: Regular expression denoting what constitutes a "token".
Stop words: Enter any words you want to omit from the analysis. Stop words typically include common words such as "the" or "an".

Related answers from Splunk Community

Preprocessing machine data using MLTK Assistants

Apply preprocessing to your data

About the preprocessing algorithms

Fields specific to each preprocessing algorithm

FieldSelector

KernelPCA

PCA

StandardScaler

TFIDF

Comments

Preprocessing machine data using MLTK Assistants

Was this topic useful?