Preprocessing

Most machine learning algorithms require input in the form of numeric matrices; most machine data is not initially in that form. Therefore, preprocessing is often required to transform events into a consumable form, while additionally addressing issues like large numbers of fields or numeric fields of wildly differing scales.

The following assistants in the Splunk Machine Learning Toolkit include an option to perform preprocessing of your data:

Preprocessing steps are saved in the Load Existing Settings tab for these assistants.

Preprocessing algorithms

The Machine Learning Toolkit uses the following algorithms to preprocess data:

Algorithm	Description
FieldSelector	The FieldSelector algorithm selects the best predictor fields based on univariate statistical tests. For example, you could use this algorithm to select features that would optimize for reducing the false positive rate.
KernelPCA	The KernelPCA algorithm reduces the number of fields by extracting uncorrelated new features out of data. It is generally good practice to use StandardScaler before KernelPCA, though it's not required. To reduce the number of dimensions, use the KernelPCA or PCA algorithms to increase performance. KernelPCA and PCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.
PCA	The PCA algorithm reduces the number of fields by extracting new uncorrelated features out of the data. It is strongly recommended to standardize fields using StandardScaler before using the PCA method. To reduce the number of dimensions, use the PCA or KernelPCA algorithms to increase performance. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.
StandardScaler	The StandardScaler algorithm standardizes the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms. StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both.

Apply preprocessing to your data

Apply preprocessing on your data using a specified Preprocess method, also known as an algorithm to a search, with an Assistant. You can add one or more preprocessing steps to your data, resulting in a set of sequential transformations. The transformed data is suitable for machine learning.

In the Splunk Machine Learning Toolkit app, select an Assistant to add preprocessing steps to your search:
- Select Assistants > Predict Numeric Fields
- Select Assistants > Predict Categorical Fields
- Select Assistants > Cluster Numeric Events
In the Create New Model tab, run a search.
Under the Preprocessing Steps section, click + Add a step link.

Select the Preprocess method, also known as an algorithm.

Algorithm	Description
FieldSelector	The FieldSelector algorithm uses the scikit-learn GenericUnivariateSelect to select the best predictor fields based on univariate statistical tests.
KernelPCA	The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data.
PCA	The PCA algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new uncorrelated features out of the data.
StandardScaler	The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively.

Fill out the applicable fields for each preprocess method. Each field is described by a tooltip that can be viewed by hovering over the field name.
Click Apply to perform the specified preprocessing.
Click Preview Results to see a table with the preprocessing results.
You will see any newly created fields as the result of the preprocessing. Fields processed using StandardScaler are prefixed with SS_, so if you selected StandardScaler as the preprocess method and the crime_rate field for preprocessing, the standardized field will be named SS_crime_rate. If you selected PCA or KernelPCA as your preprocess method, the processed fields will be renamed PC_<n>, for example, PC_1, PC_2. If you selected FieldSelector as the preprocess method, the processed fields will be renamed withfs_. Not all preprocessing algorithms generate a prefix.
You can also add more than one preprocessing step. If you are not satisfied with the results, edit the preprocessing settings. You may wish to try a different method, change the fields, or change the algorithm parameters to apply further transformations to your data. The fields available for each preprocessing step include new fields or settings generated by previous preprocessing steps.
After each preprocessing step is applied, review the output to see if you have obtained the desired results. If you are not satisfied with the results, you can remove the preprocessing step, modify the settings in the preprocessing step, or add preprocessing steps to apply additional transformations.
If you add more than one preprocessing step, click the icon to view the incremental results of each step.
Click Preview Results, located below the preprocessing steps, shows the results after a preprocessing step has been applied.
- Only the last preprocessing step can be modified. Remove it to edit the previous step.
- Removing a preprocessing step will remove any subsequent preprocessing steps, as well as any fields selected in later sections of the assistant.

When you click the fit model button, it will both fit that model as well as any preprocessing steps. When you save the main model, preprocessing models are saved as well. If you are satisfied with the results of the preprocessing, you can use the fields created during preprocessing for further training and fitting of the model.

User Guide

Related Answers

Preprocessing

Preprocessing algorithms

Apply preprocessing to your data

Comments

Preprocessing