Preprocessing machine data using MLTK Assistants
Preprocessing steps transform your machine data into fields ready for modeling or visualization. Preprocessing steps include algorithms that reduce the number of fields, produce numeric fields from unstructured text, or rescale numeric fields.
This document covers preprocessing in the Experiments and Classic Assistant context. For information on all the available preprocessing algorithm options to use in the MLTK, please see the preprocessing section of the algorithms document.
When you use the Predict Numeric Fields, Predict Categorical Fields, or Cluster Numeric Events Assistants in the MLTK, you have the option to apply one or more preprocessing algorithms. Choose the algorithm to best suit the preprocessing needs of your data. There are five preprocessing algorithm options:
Preprocessing steps are included in the Smart Forecasting Assistant but not covered in this document. For details, see Smart Forecasting Assistant.
Apply preprocessing to your data
The following steps are the same for both Experiment and Classic Assistant workflows:
 Under the Preprocessing Steps section, click +Add a step.
 From the Preprocess method dropdown menu, choose FieldSelector, KernelPCA, PCA, StandardScaler or TFIDF. Fill in the fields for the selected method.
 Click Apply to perform the specified preprocessing.
 Click Preview Results to see a table with the preprocessing results, including any newly created fields.
If  Then 

You wish to add more than one preprocessing step:  Click +Add a step again in the Preprocessing Steps section. The preorocessing step includes new fields or settings generated by previous preprocessing steps. Additional preprocessing applies sequential transformations. After you apply each preprocessing step, review the results against your data. You can modify a preprocessing step as long as no other steps have been added. Once you add another preprocessing step, previous steps cannot be modified. You can remove a preprocessing step, but doing so also removes all subsequent preprocessing steps as well as any fields selected in later sections of the Assistant for model training and fitting. 
You are satisfied with the preprocessing results:  Continue through the remaining Assistant sections to complete training and testing the model. When you click the fit button, it will both fit that model as well as any preprocessing steps. When you save the main model (available in the experiments workflows) preprocessing models are saved as well.

You are not satisfied with the results of your preprocessing:  Try to remove the preprocessing step, modify the settings in the preprocessing step, or add more preprocessing steps to apply more data transformations. 
About the preprocessing algorithms
Choose the algorithm to best suit the preprocessing needs of your data. Refer to this list to guide your decision:
FieldSelector
The FieldSelector algorithm uses the scikitlearn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.
KernelPCA
The KernelPCA algorithm uses the scikitlearn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. It is strongly recommended to standardize fields using StandardScaler before using the KernelPCA method.
To reduce the number of dimensions, use the KernelPCA or PCA algorithms to increase performance. KernelPCA and PCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.
PCA
The PCA algorithm uses the scikitlearn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data. To reduce the number of dimensions, use the PCA or KernelPCA algorithms to increase performance. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.
StandardScaler
The StandardScaler algorithm uses the scikitlearn StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms.
StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both.
TFIDF
The TFIDF algorithm converts raw text into numeric fields, making it possible to use that data with other machine learning algorithms. The TFIDF algorithm selects Ngrams, which are groups of N consecutive string (or term), from fields containing freeform text and converts them into numeric fields amenable to machine learning. For example, running TFIDF on a field containing email Subjects might select the bigram 'project proposal' and create a field indicating the weighted frequency of that bigram in each Subject.
Fields specific to each preprocessing algorithm
Each of the preprocessing algorithms has its own unique fields and field order. Within the MLTK, hover over any field to see more information, or review content below.
FieldSelector
 Select the field to predict: Make a single selection from the dropdown menu.
 Select the predictor fields: Click to choose one at a time, or choose to select all.
 Type: Select categorical or numeric as the field to predict.
 Mode: Select the mode for field selection.
 Percentile: Select a percentage of fields with the highest scores.
 Kbest: Select the K fields with the highest scores.
 False positive rate: Select fields with pvalues below alpha based on a false positive rate (FPR) test.
 False discovery rate: Select fields with pvalues below alpha for an estimated false discovery rate (FDR) test.
 Familywise error rate: Select fields with pvalues below alpha based on familywise error rate (FWER).
 Percentile: Select a percentage of fields with the highest scores.
 Percent: Input a percent value of features to return.
KernelPCA
If you select 'KernelPCA as your preprocess method, the processed fields will be renamed PC_
, for example, PC_1
, PC_2
.
 Select the fields to preprocess: Click to choose one at a time, or choose to select all.
 K (# of Components): Specify the number of principal components.
 Gamma: Enter the kernel coefficient for the rbf kernel.
 Tolerance: Enter the convergence tolerance. If 0, an optimal value is chosen using arpack.
 Max iteration: Enter the maximum number of iterations. If not specified, an optimal value is chosen using arpack.
PCA
If you select PCA as your preprocess method, the processed fields are renamed PC_
, for example: PC_1
, PC_2
.
 Select the fields to preprocess: Click to choose one at a time, or choose to select all.
 K (# of Components): Specify the number of principal components.
StandardScaler
Fields processed using StandardScaler are prefixed with SS_
. For example, if you select StandardScaler as the preprocessing method and the crime_rate
field for preprocessing, the standardized field is named SS_crime_rate
 Select the fields to preprocess: Click to choose one at a time, or choose to select all.
 Standardize Fields: Select whether to center values with respect to the mean, scale them with respect to the standard deviation, or both.
TFIDF
 Select the field to preprocess: Make a single selection from the dropdown menu.
 Max features: Build a vocabulary that only considers the top K features ordered by term frequency.
 Max document frequency: Ignore terms that have a document frequency strictly higher than the given threshold.
This field supports one of the following value types:
 Integer: absolute count
 Float: a frequency of documents (between 0 and 1)
 Integer: absolute count
 Min document frequency (cutoff): Ignore terms that have a document frequency strictly lower than the given threshold.
This field supports one of the following value types:
 Integer: absolute count
 Float: a frequency of documents (between 0 and 1)
 Integer: absolute count
 Ngram range: The lower and upper boundary of the range of Nvalues for different Ngrams to be extracted.
 Analyzer: Select whether the feature is made of word or character Ngrams. This field defaults as set to word. Choose "char" to treat each letter like a word, resulting in substrings of N consecutive characters, including spaces.
 Norm: Norm used to normalize term vectors.
 Token pattern: Regular expression denoting what constitutes a "token".
 Stop words: Enter any words you want to omit from the analysis. Stop words typically include common words such as "the" or "an".
PREVIOUS Preparing your data for machine learning 
NEXT Smart Forecasting Assistant 
This documentation applies to the following versions of Splunk^{®} Machine Learning Toolkit: 4.4.0, 4.4.1, 4.4.2, 4.5.0, 5.0.0, 5.1.0
Feedback submitted, thanks!