Preprocessing Methods

Preprocessing methods help transform your machine data into fields amenable to modeling or visualization. These include algorithms for reducing the number of fields, producing numeric fields from unstructured text, or rescaling numeric fields.

Three assistants in the Splunk Machine Learning Toolkit include an option to perform preprocessing of your data:

In each of these assistants, there are five preprocessing algorithm options:

Choose the algorithm(s) that are appropriate for your data and your goals.

The transformed data is then passed to the model step.

Apply preprocessing to your data

Preprocessing methods are available through the MLTK assistants. Follow the steps below:

Log in to the MLTK app and choose Experiments from navigation bar. Choose Create New Experiment from the top right.
From the resulting modal, choose the Experiment Type (assistant) of either Predict Numeric Field, Predict Categorical Field or Cluster Numeric Events.
Add a title, and optionally add a description. Click Create when ready.
You land on the Experiments Settings tab. Now, enter a search.
Under the Preprocessing Steps section, click +Add a step.
From the Preprocess method dropdown menu, choose: FieldSelector, KernelPCA, PCA, StandardScaler or TFIDF. Fill in the fields related to the selected method.
Click Apply to perform the specified preprocessing.
Click Preview Results to see a table with the results of the preprocessing, including any newly created fields.
Optionally choose to add more than one preprocessing steps. Just click +Add a step again in the Preprocessing Steps section of the page. The fields available for each preprocessing step include new fields or settings generated by previous preprocessing steps.
Additional preprocessing applies sequential transformations. After each preprocessing step is applied, review the results to your data to see if you have obtained the desired results.
If you are not satisfied with the results, you can:
- Remove the preprocessing step
- Modify the settings in the preprocessing step
- Or add additional preprocessing steps to apply additional transformations
After a preprocessing step has been applied, it can be modified as long as no other steps have been added. Once another preprocessing step has been added, previous steps cannot be modified. Steps can be removed, but removing a step also removes all subsequent preprocessing steps as well as any fields that have been selected in later sections of the assistant for model training and fitting.

Preprocessing steps, as well as other experiment settings, are saved in the Experiments History tab.
If you are satisfied with the preprocessing results, continue through the remaining experiment sections to fit the model and save the experiment.
When you click the fit button, it will both fit that model as well as any preprocessing steps. When you save the main model, preprocessing models are saved as well. If you are satisfied with the results of the preprocessing, you can use the fields created during preprocessing for further training and fitting of the model.

About the preprocessing algorithms

When you use the Predict Numeric Fields, Predict Categorical Fields or Cluster Numeric Events assistants in MLTK, you have the option to apply one or more preprocessing algorithms. Choose the algorithm to best suit the preprocessing needs of your data. Refer to the table below to help guide your decision:

Preprocessing Algorithm	Description
FieldSelector	The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.
KernelPCA	The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. It is strongly recommended to standardize fields using StandardScaler before using the KernelPCA method. To reduce the number of dimensions, use the KernelPCA or PCA algorithms to increase performance. KernelPCA and PCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.
PCA	The PCA algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data. To reduce the number of dimensions, use the PCA or KernelPCA algorithms to increase performance. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.
StandardScaler	The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms. StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both.
TFIDF	TFIDF converts raw text into numeric fields, making it possible to use that data with other machine learning algorithms. The TFIDF algorithm selects N-grams, which are groups of N consecutive string (or term), from fields containing free-form text and converts them into numeric fields amenable to machine learning. For example, running TFIDF on a field containing email Subjects might select the bi-gram ‘project proposal’ and create a field indicating the weighted frequency of that bi-gram in each Subject.

Fields specific to each preprocessing algorithm

FieldSelector

Select the field to predict: Make a single selection from the dropdown menu.
Select the predictor fields: Click to choose one at a time, or choose to select all.
Type: Is the field to predict categorical or numeric?
Mode: Select the mode for field selection.
- Percentile: Select a percentage of fields with the highest scores.
- K-best: Select the k fields with the highest scores.
- False positive rate: Select fields with p-values below alpha based on a False Positive Rate (FPR) test.
- False discovery rate: Select fields with p-values below alpha for an estimated False Discovery Rate (FDR) test.
- Family-wise error rate: Select fields with p-values below alpha based on Family-Wise Error (FWE) rate.
Percent: Input a percent value of features to return.

KernelPCA

If you select 'KernelPCA as your preprocess method, the processed fields will be renamed PC_, for example, PC_1, PC_2.

Select the fields to preprocess: Click to choose one at a time, or choose to select all.
K (# of Components): Specify the number of principal components, K new fields will be created with the prefix "PC_".
Gamma: Kernel coefficient for the rbf kernel.
Tolerance: Convergence tolerance. If 0, an optimal value will be chosen using arpack.
Max iteration: Maximum number of iterations. If not specified, an optimal value will be chosen using arpack.

PCA

If you selected PCA as your preprocess method, the processed fields will be renamed PC_, for example, PC_1, PC_2.

Select the fields to preprocess: Click to choose one at a time, or choose to select all.
K (# of Components): Specify the number of principal components, K new fields will be created with the prefix "PC_".

StandardScaler

Fields processed using StandardScaler are prefixed with SS_. For example if you selected StandardScaler as the preprocess method and the crime_rate field for preprocessing, the standardized field will be named SS_crime_rate

Select the fields to preprocess: For each selected field, a new field will be created with the prefix "SS_".
Standardize Fields: Select whether to center values with respect to the mean, scale them with respect to the standard deviation, or both.

TFIDF

Select the field to preprocess: Make a single selection from the drop down menu.
Max features: Build a vocabulary that only considers the top K features ordered by term frequency.
Max document frequency: Ignore terms that have a document frequency strictly higher than the given threshold.
This field supports one of the following value types:
- integer - absolute count
- float - a frequency of documents (between 0 and 1)
Min document frequency (cut-off): |gnore terms that have a document frequency strictly lower than the given threshold.
This field supports one of the following value types:
- integer - absolute count
- float - a frequency of documents (between 0 and 1)
N-gram range: The lower and upper boundary of the range of N-values for different N-grams to be extracted.
Analyzer: Whether the feature should be made of word or character N-grams. Defaults to word. Choose "char" to treat each letter like a word, resulting in sub-strings of N consecutive characters, including spaces.
Norm: Norm used to normalize term vectors.
Token pattern: Regular expression denoting what constitutes a "token".
Stop words: The list of words to omit from the analysis. Stop words typically include common words such as "the" or "an".

User Guide

Related Answers

Preprocessing Methods

Apply preprocessing to your data

About the preprocessing algorithms

Fields specific to each preprocessing algorithm

FieldSelector

KernelPCA

PCA

StandardScaler

TFIDF

Comments

Preprocessing Methods