Splunk® Machine Learning Toolkit

ML-SPL API Guide

# Correlation Matrix

This example covers the following tasks:

• using the `BaseAlgo` class
• validating search syntax
• converting parameters

In this example, you use the Python library pandas, which is part of the Python for Scientific Computing app. The `DataFrame.corr` method constructs a correlation matrix. See the pandas library documentation for more information on this method. In addition to constructing the correlation matrix, you pass a parameter to the algorithm to switch between Pearson, Kendall and Spearman correlations.

This example uses the ML-SPL API available in the Splunk Machine Learning Toolkit version 2.2.0 and later. Verify your Splunk Machine Learning Toolkit version before using this example.

A search using this custom algorithm can look like this:

`index=foo sourcetype=bar | fit CorrelationMatrix method=kendall <fields>`

## Steps

Fit a correlation matrix on all `<fields>`:

1. Register the algorithm in `algos.conf` using one of the following methods.
1. Register the algorithm using the REST API:
```\$ curl -k -u admin:<admin pass> https://localhost:8089/servicesNS/nobody/Splunk_ML_Toolkit/configs/conf-algos -d name="CorrelationMatrix"
```
2. Register the algorithm manually:
Modify or create the `algos.conf` file located in `\$SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/local/` and add the following stanza to register your algorithm:
``` [CorrelationMatrix]
```

When you register the algorithm with this method, you must restart Splunk Enterprise.

2. Create the python file in the `algos` folder. For this example, you create `\$SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/bin/algos/CorrelationMatrix.py`.
Import the relevant modules. In this case, use the `BaseAlgo` class which provides a skeleton class to catch errors.
`from base import BaseAlgo`
3. Define the class.
Inherit from `BaseAlgo`. The class name is the name of the algorithm.
```class CorrelationMatrix(BaseAlgo):
"""Compute and return a correlation matrix."""
```
4. Define the `__init__` method.
The `__init__` method passes the options from the search to the algorithm. Ensure that there are fields present and no `from` clause and that only valid methods are used by raising `RuntimeError` appropriately:
```    def __init__(self, options):
"""Check for valid correlation type, and save it to an attribute on self."""

feature_variables = options.get('feature_variables', {})
target_variable = options.get('target_variable', {})

if len(feature_variables) == 0:
raise RuntimeError('You must supply one or more fields')

if len(target_variable) > 0:
raise RuntimeError('CorrelationMatrix does not support the from clause')

valid_methods = ['spearman', 'kendall', 'pearson']

# Check to see if parameters exist
params = options.get('params', {})

# Check if method is in parameters in search
if 'method' in params:
if params['method'] not in valid_methods:
error_msg = 'Invalid value for method: must be one of {}'.format(
', '.join(valid_methods))
raise RuntimeError(error_msg)

# Assign method to self for later usage
self.method = params['method']

# Assign default method & ensure no other parameters are present
else:
# Default method for correlation
self.method = 'pearson'

if len(params) > 0:
raise RuntimeError('The only valid parameter is method.')
```

The options that are passed to this method are closely related to the SPL search query being used.

For a simple query such as:

`| fit LinearRegression sepal_width from petal* fit_intercept=t`

The options returned are:

``` {
'args': [u'sepal_width', u'petal*'],
'params': {u'fit_intercept': u't'},
'feature_variables': ['petal*'],
'target_variable': ['sepal_width']
'algo_name': u'LinearRegression',
}
```

This dictionary of options includes:

```- args (list) - a list of the fields used
- params (dict) - any parameters (key-value) pairs in the search
- feature_variables (list) - fields to be used as features
- target_variable (list) - the target field for prediction
- algo_name (str) - the name of algorithm
```

Other keys that may exist depending on the search:

```- model_name (str) - the name of the model being saved ('into' clause)
- output_name (str) - the name of the output ('as' clause)
```

The `feature_fields` and `target field` are related to the syntax of the search. If a `from` clause is present:

`| fit LinearRegression target_variable from feature_variables`

whereas with an unsupervised algorithm such as KMeans:

`| fit KMeans feature_variables`

The `feature_variables` in the options have not been wildcard matched against the available data. If there are wildcards (*) in the field names, the wildcards are present in the `feature_variables`.

5. Define the `fit` method.
The `fit` method is where you compute the correlations. Afterwards, return the DataFrame.
```def fit(self, df, options):
"""Compute the correlations and return a DataFrame."""

# df contains all the search results, including hidden fields
# but the  requested requested are saved as self.feature_variables
requested_columns = df[self.feature_variables]

# Get correlations
correlations = requested_columns.corr(method=self.method)

# Reset index so that all the data are in columns
# (this is usually not necessary, but is for the corr method)
output_df = correlations.reset_index()

return output_df
```

When defining the `fit` method, you have the option to either return values or to do nothing, which returns `None`. If you return the dataframe, no `apply` method is needed. The `apply` method is only needed when a saved model must make predictions on unseen data.

## Finished example

```from base import BaseAlgo

class CorrelationMatrix(BaseAlgo):
"""Compute and return a correlation matrix."""

def __init__(self, options):
"""Check for valid correlation type, and save it to an attribute on self."""

feature_variables = options.get('feature_variables', {})
target_variable = options.get('target_variable', {})

if len(feature_variables) == 0:
raise RuntimeError('You must supply one or more fields')

if len(target_variable) > 0:
raise RuntimeError('CorrelationMatrix does not support the from clause')

valid_methods = ['spearman', 'kendall', 'pearson']

# Check to see if parameters exist
params = options.get('params', {})

# Check if method is in parameters in search
if 'method' in params:
if params['method'] not in valid_methods:
error_msg = 'Invalid value for method: must be one of {}'.format(
', '.join(valid_methods))
raise RuntimeError(error_msg)

# Assign method to self for later usage
self.method = params['method']

# Assign default method and ensure no other parameters are present
else:
# Default method for correlation
self.method = 'pearson'

if len(params) > 0:
raise RuntimeError('The only valid parameter is method.')

def fit(self, df, options):
"""Compute the correlations and return a DataFrame."""

# df contains all the search results, including hidden fields
# but the requested requested are saved as self.feature_variables
requested_columns = df[self.feature_variables]

# Get correlations
correlations = requested_columns.corr(method=self.method)

# Reset index so that all the data are in columns
# (this is necessary for the corr method)
output_df = correlations.reset_index()

return output_df
```

## Example search

You might have to reorder your fields with the fields or table command.