Preprocessing

Some of the assistants in the Splunk Machine Learning Toolkit include an option to perform preprocessing of your data. Very often a raw data set cannot be used directly as a training set. Preprocessing allows you to run certain algorithms on your data to prepare it for machine learning. For example, preprocessing is useful if your data contains a large number of fields or if the fields have various scales.

You can add one or more preprocessing steps to apply sequential transformations to your data in order to produce a data set that is suitable for machine learning. After you apply each preprocessing step, you can review the results to your data to see if you obtained the desired results. If you are not satisfied with the results, you can remove the preprocessing step, modify the settings in the preprocessing step, or add additional preprocessing steps to apply additional transformations. Preprocessing steps are saved in the assistant history along with all the other settings used in the assistant and can be accessed from the Load Existing Settings tab.

Preprocessing Algorithms

Apply preprocessing to your data

To perform preprocessing in an assistant, do the following:

After entering a search, click the + Add a step link under the Preprocessing Steps section.
Under New Preprocess Step, select the preprocessing method to use: StandardScaler, PCA, or KernelPCA.
- StandardScaler is useful when the fields have very different scales. StandardScaler standardizes numeric fields by centering about the mean, rescaling to have a standard deviation of one, or both.
- If you have too many fields, the performance of some algorithms can drop drastically. For this case, use PCA or KernelPCA to reduce the number of dimensions. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to project into 2D in order to display a scatterplot chart.
  
  It is strongly recommended to standardize fields using StandardScaler before using the KernelPCA method.
Specify one or more fields to preprocess. Click in the field to select from a dropdown list of the fields in your data. You can also type in field names and use wildcards (*).
Select any options you would like to use with the selected preprocess method.
- For StandardScaler, check the with_mean option to standardize the fields with respect to their mean or with_std to standardize the fields with respect to their standard deviation.
- For PCA, optionally specify the number of features to extract from the data in the K (# of Components) field. This number must be less than or equal to the number of fields selected and determines the number of new fields created. If no value is provided, the number of fields created is equal to the number of fields selected.
- For KernelPCA, optionally specify the number of features to extract from the data in the K (# of Components) field. This number must be less than or equal to the number of fields selected and determines the number of new fields created. If no value is provided, the number of fields created is two. The other parameters are optional for fine tuning of the kernel.
Click Apply to perform the specified preprocessing.
Click Preview Results to see a table with the preprocessing results. You will see any newly created fields as the result of the preprocessing.
Fields processed using StandardScaler are prefixed with "SS_", thus if you selected StandardScaler as the preprocess method and the crime_rate field for preprocessing, the standardized field will be named SS_crime_rate.
If you selected PCA or KernelPCA as your preprocess method, the processed fields will be renamed "PC_<n>", for example, PC_1, PC_2.
If you are not satisfied with the results, you can edit the preprocessing step to try a different method or to change the fields or settings, you can add another preprocessing step to apply further transformations to your data, or you can remove the preprocessing step. The fields available for each preprocessing step include new fields generated by previous preprocessing steps.
If you add more than one preprocessing step, you can view the incremental results of each step by clicking the icon on the right side of the step. The Preview Results link below the steps shows the results after all preprocessing steps have been applied.

After a preprocessing step has been applied, it can be modified as long as no other steps have been added. Once another preprocessing step has been added, previous steps cannot be modified. Steps can be removed, but removing a step also removes all subsequent preprocessing steps as well as any fields that have been selected in later sections of the assistant for model training and fitting.
Once you are satisfied with the results of the preprocessing, you can use the fields created during preprocessing for training and fitting the model.

Related answers from Splunk Community

Preprocessing

Preprocessing Algorithms

Apply preprocessing to your data

Comments

Preprocessing

Was this topic useful?