Splunk® Machine Learning Toolkit

ML-SPL API Guide

Download manual as PDF

Download topic as PDF

Correlation Matrix example

This example uses the Python library pandas which is part of the Python for Scientific Computing app. This Correlation Matrix example covers the following tasks:

  • Using the BaseAlgo class
  • Validating search syntax
  • Converting parameters

The DataFrame.corr method constructs a correlation matrix. In addition to constructing the correlation matrix, you pass a parameter to the algorithm to switch between Pearson, Kendall and Spearman correlations. See the pandas library documentation for more information on this method.

A search using this custom algorithm looks like this:

index=foo sourcetype=bar | fit CorrelationMatrix method=kendall <fields>

Steps

Follow these steps to add the Correlation Matrix algorithm.

Fit a correlation matrix on all <fields>:

  1. Register the algorithm in algos.conf using one of the following methods.
    1. Register the algorithm using the REST API:
      $ curl -k -u admin:<admin pass> https://localhost:8089/servicesNS/nobody/Splunk_ML_Toolkit/configs/conf-algos -d name="CorrelationMatrix"
      
    2. Register the algorithm manually:
      Modify or create the algos.conf file located in $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/local/ and add the following stanza to register your algorithm:
       [CorrelationMatrix]
      

      When you register the algorithm with this method, you must restart Splunk Enterprise.

  2. Create the python file in the algos folder. For this example, you create $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/bin/algos/CorrelationMatrix.py.
    Import the relevant modules. In this case, use the BaseAlgo class which provides a skeleton class to catch errors.
    from base import BaseAlgo
  3. Define the class.
    Inherit from BaseAlgo. The class name is the name of the algorithm.
    class CorrelationMatrix(BaseAlgo):
        """Compute and return a correlation matrix."""
    
  4. Define the __init__ method.
    The __init__ method passes the options from the search to the algorithm. Ensure that there are fields present and no from clause and that only valid methods are used by raising RuntimeError appropriately:
        def __init__(self, options):
            """Check for valid correlation type, and save it to an attribute on self."""
    
            feature_variables = options.get('feature_variables', {})
            target_variable = options.get('target_variable', {})
    
            if len(feature_variables) == 0:
                raise RuntimeError('You must supply one or more fields')
    
            if len(target_variable) > 0:
                raise RuntimeError('CorrelationMatrix does not support the from clause')
    
            valid_methods = ['spearman', 'kendall', 'pearson']
    
            # Check to see if parameters exist
            params = options.get('params', {})
    
            # Check if method is in parameters in search
            if 'method' in params:
                if params['method'] not in valid_methods:
                    error_msg = 'Invalid value for method: must be one of {}'.format(
                        ', '.join(valid_methods))
                    raise RuntimeError(error_msg)
    
                # Assign method to self for later usage
                self.method = params['method']
    
            # Assign default method & ensure no other parameters are present
            else:
                # Default method for correlation
                self.method = 'pearson'
    
                # Check for bad parameters
                if len(params) > 0:
                    raise RuntimeError('The only valid parameter is method.')
    

    The options that are passed to this method are closely related to the SPL search query being used.

    For a simple query such as:

    | fit LinearRegression sepal_width from petal* fit_intercept=t

    The options returned are:

     {
     	 'args': [u'sepal_width', u'petal*'],
    	 'params': {u'fit_intercept': u't'},
    	 'feature_variables': ['petal*'],
    	 'target_variable': ['sepal_width']
    	 'algo_name': u'LinearRegression',
     }
    

    This dictionary of options includes:

    - args (list) - a list of the fields used
    - params (dict) - any parameters (key-value) pairs in the search
    - feature_variables (list) - fields to be used as features
    - target_variable (list) - the target field for prediction
    - algo_name (str) - the name of algorithm
    

    Other keys that may exist depending on the search:

    - model_name (str) - the name of the model being saved ('into' clause)
    - output_name (str) - the name of the output ('as' clause)
    

    The feature_fields and target field are related to the syntax of the search. If a from clause is present:

    | fit LinearRegression target_variable from feature_variables

    whereas with an unsupervised algorithm such as KMeans:

    | fit KMeans feature_variables

    The feature_variables in the options have not been wildcard matched against the available data. If there are wildcards (*) in the field names, the wildcards are present in the feature_variables.

  5. Define the fit method.
    The fit method is where you compute the correlations. Afterwards, return the DataFrame.
    def fit(self, df, options):
            """Compute the correlations and return a DataFrame."""
    
            # df contains all the search results, including hidden fields
            # but the  requested requested are saved as self.feature_variables
            requested_columns = df[self.feature_variables]
    
            # Get correlations
            correlations = requested_columns.corr(method=self.method)
    
            # Reset index so that all the data are in columns
            # (this is usually not necessary, but is for the corr method)
            output_df = correlations.reset_index()
    
            return output_df
    

    When defining the fit method, you have the option to either return values or to do nothing, which returns None. If you return the dataframe, no apply method is needed. The apply method is only needed when a saved model must make predictions on unseen data.

Finished example

from base import BaseAlgo


class CorrelationMatrix(BaseAlgo):
    """Compute and return a correlation matrix."""

    def __init__(self, options):
        """Check for valid correlation type, and save it to an attribute on self."""

        feature_variables = options.get('feature_variables', {})
        target_variable = options.get('target_variable', {})

        if len(feature_variables) == 0:
            raise RuntimeError('You must supply one or more fields')

        if len(target_variable) > 0:
            raise RuntimeError('CorrelationMatrix does not support the from clause')

        valid_methods = ['spearman', 'kendall', 'pearson']

        # Check to see if parameters exist
        params = options.get('params', {})

        # Check if method is in parameters in search
        if 'method' in params:
            if params['method'] not in valid_methods:
                error_msg = 'Invalid value for method: must be one of {}'.format(
                    ', '.join(valid_methods))
                raise RuntimeError(error_msg)

            # Assign method to self for later usage
            self.method = params['method']

        # Assign default method and ensure no other parameters are present
        else:
            # Default method for correlation
            self.method = 'pearson'

            # Check for bad parameters
            if len(params) > 0:
                raise RuntimeError('The only valid parameter is method.')

    def fit(self, df, options):
        """Compute the correlations and return a DataFrame."""

        # df contains all the search results, including hidden fields
        # but the requested requested are saved as self.feature_variables
        requested_columns = df[self.feature_variables]

        # Get correlations
        correlations = requested_columns.corr(method=self.method)

        # Reset index so that all the data are in columns
        # (this is necessary for the corr method)
        output_df = correlations.reset_index()

        return output_df

Example search

Search correlation.png

You might have to reorder your fields with the fields or table command.

PREVIOUS
Package an algorithm for Splunkbase
  NEXT
Agglomerative Clustering example

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 4.4.0, 4.4.1, 4.4.2, 5.0.0


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters