Preprocessing

Very often a raw data set cannot be used directly as a training set. Preprocessing allows you to run certain algorithms on your data to prepare it for machine learning. For example, preprocessing is useful if your data contains a large number of fields, or if the fields have various scales.

You can add one or more preprocessing steps to apply sequential transformations to your data in order to produce a data set that is suitable for machine learning. After you apply each preprocessing step, you can review the results to your data to see if you obtained the desired results. If you are not satisfied with the results, you can remove the preprocessing step, modify the settings in the preprocessing step, or add additional preprocessing steps to apply additional transformations. Preprocessing steps are saved in the assistant history along with all the other settings used in the assistant and can be accessed from the Load Existing Settings tab.

The following assistants in the Splunk Machine Learning Toolkit include an option to perform preprocessing of your data:

Predict Numeric Fields
Predict Categorical Fields
Cluster Numeric Events

Preprocessing Algorithms

The Preprocessing assistant uses the following algorithms:

Apply preprocessing to your data

Follow the steps below to perform preprocessing in an assistant.

Run a search.
Click + Add a step link under the Preprocessing Steps section.
Select the preprocessing method.

StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both. If you have too many fields, the performance of some algorithms can drop drastically. For this case, use PCA or KernelPCA to reduce the number of dimensions. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.

It is strongly recommended to standardize fields using StandardScaler before using the KernelPCA method.
Specify one or more fields to preprocess.
Click in the field to select from a dropdown list of the fields in your data. You can also type in field names and use wildcards (*).

Select any options you would like to use with the selected preprocess method.

Method	Specifications
StandardScaler	Check the with_mean box to standardize the fields with respect to their mean, or with_std to standardize the fields with respect to their standard deviation.
PCA	(Optional) Specify the number of features to extract from the data in the K (# of Components) field. This number determines the number of new fields created. If no value is provided, the number of fields created is equal to the number of fields selected.
KernelPCA	(Optional) Specify the number of features to extract from the data in the K (# of Components) field. This number must be less than or equal to the number of fields selected and determines the number of new fields created. If no value is provided, the number of fields created is two. The other parameters are optional for fine tuning of the kernel.

Click Apply to perform the specified preprocessing.
Click Preview Results to see a table with the preprocessing results. You will see any newly created fields as the result of the preprocessing. Fields processed using StandardScaler are prefixed with SS_, so if you selected StandardScaler as the preprocess method and the crime_rate field for preprocessing, the standardized field will be named SS_crime_rate. If you selected PCA or KernelPCA as your preprocess method, the processed fields will be renamed PC_<n>, for example, PC_1, PC_2.
If you are not satisfied with the results, edit the preprocessing settings. You can try a different method, change fields, or the settings. You can also add another preprocessing step to apply further transformations to your data, or you can remove the preprocessing step. The fields available for each preprocessing step include new fields generated by previous preprocessing steps.
If you add more than one preprocessing step, click the icon to view the incremental results of each step.
The Preview Results link below the steps shows the results after all preprocessing steps have been applied.

After a preprocessing step has been applied, it can be modified as long as no other steps have been added. Once another preprocessing step has been added, previous steps cannot be modified. Steps can be removed, but removing a step also removes all subsequent preprocessing steps as well as any fields that have been selected in later sections of the assistant for model training and fitting.
If you are satisfied with the results of the preprocessing, you can use the fields created during preprocessing for training and fitting the model.

Related answers from Splunk Community

Preprocessing

Preprocessing Algorithms

Apply preprocessing to your data

Comments

Preprocessing

Was this topic useful?