Splunk® Machine Learning Toolkit

ML-SPL API Guide

This documentation does not apply to the most recent version of Splunk® Machine Learning Toolkit. For documentation on the most recent version, go to the latest release.

Saving models

Codecs in the Splunk Machine Learning Toolkit

The Splunk Machine Learning Toolkit uses codecs to serialize (save or encode) and deserialize (load or decode) algorithm models. A codec facilitates the core part of the serialization/deserialization process of a python object in memory to file.

The Splunk Machine Learning Toolkit does not use pickles to serialize objects in Python. Instead, it uses a string representation of __dict__ or use __getstate__ and __setstate__ to save and recreate objects. Python objects are converted to JSON objects, then saved into CSV files, and used as lookups within Splunk Enterprise.

To save the model of the algorithm, the algorithm must implement the register_codecs() method. This method is invoked when algorithm.save_model() is called, and when we call algorithm.save_model(), it uses the following process to find the right codec for your algorithm class:

Saving models mlapp decision flow diagram.png

The Splunk Machine Learning Toolkit ships with built-in codecs. This documentation shows some examples of how to use them to implement the register_codecs() method in your custom algorithm.

Built-in codecs

Pre-registered classes

The following classes are always loaded into the codec manager, so there is no need to explicitly define objects of these classes in register_codecs().

__buildin__.object
__buildin__.slice
__buildin__.set
__buildin__.type
numpy.ndarray
numpy.int8
numpy.int16
numpy.int32
numpy.int64
numpy.uint8
numpy.uint16
numpy.uint32
numpy.uint64
numpy.float16
numpy.float32
numpy.float64
numpy.float128
numpy.complex64
numpy.complex128
numpy.complex256
numpy.dtype
pandas.core.frame.DataFrame
pandas.core.index.Index
pandas.core.index.Int64Index
pandas.core.internals.BlockManager

The list of pre-registered codecs can be found in $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/bin/codec/codecs.py.

SimpleObjectCodec

SimpleObjectCodec can be used for any object that can be represented as a dictionary or a list.

We can show this with the Support Vector Regressor example on importing a custom algorithm.

In this custom algorithm, the codecs have already been configured:

@staticmethod
def register_codecs():
    from codec.codecs import SimpleObjectCodec
    from codec import codecs_manager
    codecs_manager.add_codec('algos.SVR', 'SVR', SimpleObjectCodec)
    codecs_manager.add_codec('sklearn.svm.classes', 'SVR', SimpleObjectCodec)

You need codecs for both algos.SVR.SVR and sklearn.svm.classes.SVR. How do you know which codecs to use?

In most situations, you can use SimpleObjectCodec for the wrapper class (algos.SVR.SVR). For the SVR module imported from sklearn, you must verify that the algorithm object that is created has a proper __dict__. For this example, you can try the follows in Python terminal:

>>> from sklearn.svm import SVR
>>> classifier = SVR()
>>> X = [[1,2],[3,4]]
>>> y = [55, 66]
>>> classifier.fit(X, y)
>>> classifier.__dict__

which returns:

{'C': 1.0,
 '_dual_coef_': array([[-1.,  1.]]),
 '_gamma': 0.5,
 '_impl': 'epsilon_svr',
 '_intercept_': array([ 60.5]),
 '_sparse': False,
 'cache_size': 200,
 'class_weight': None,
 'class_weight_': array([], dtype=float64),
 'coef0': 0.0,
 'degree': 3,
 'dual_coef_': array([[-1.,  1.]]),
 'epsilon': 0.1,
 'fit_status_': 0,
 'gamma': 'auto',
 'intercept_': array([ 60.5]),
 'kernel': 'rbf',
 'max_iter': -1,
 'n_support_': array([         0, 1073741824], dtype=int32),
 'nu': 0.0,
 'probA_': array([], dtype=float64),
 'probB_': array([], dtype=float64),
 'probability': False,
 'random_state': None,
 'shape_fit_': (2, 2),
 'shrinking': True,
 'support_': array([0, 1], dtype=int32),
 'support_vectors_': array([[ 1.,  2.],
        [ 3.,  4.]]),
 'tol': 0.001,
 'verbose': False}

The returned __dict__ object contains objects/values that are either supported by the json.JSONEncoder, or is one of the pre-registered classes shown in the example.

If one or more objects in __dict__ do not have built-in codec support, you can write a custom codec for them.

Write a custom codec

This example shows you how to write a custom codec for KNeighborsClassifier algorithm. First, you can try to use SimpleObjectCodec,

KNClassifier.py
#!/usr/bin/env python
 
from sklearn.neighbors import KNeighborsClassifier
 
from codec import codecs_manager
from base import BaseAlgo, ClassifierMixin
from util.param_util import convert_params
 
 
class KNClassifier(ClassifierMixin, BaseAlgo):
 
    def __init__(self, options):
        self.handle_options(options)
        params = options.get('params', {})
        out_params = convert_params(
            params,
            ints=['k'],
            aliases={'k': 'n_neighbors'}
        )
        self.estimator = KNeighborsClassifier(**out_params)
 
    @staticmethod
    def register_codecs():
        from codec.codecs import SimpleObjectCodec
        codecs_manager.add_codec('algos.KNClassifier', 'KNClassifier', SimpleObjectCodec)
        codecs_manager.add_codec('sklearn.neighbors.classification', 'KNeighborsClassifier', SimpleObjectCodec)

In this case, the SimpleObjectCodec is not sufficient. When you run command ... | fit KNClassifier into my_model, it will return an error like this:

Saving models mlapp custom codec kd tree.png

Investigate an object for a custom codec

The error message indicated that part of the model sklearn.neighbors.kd_tree.KDTree is not serializable. You can investigate the object in Python terminal:

>>> from sklearn.datasets import load_iris
>>> from sklearn.neighbors import KNeighborsClassifier

>>> iris = load_iris()
>>> X = iris.data
>>> y = iris.target
>>> classifier = KNeighborsClassifier()

>>> classifier.fit(X, y)
>>> classifier.__dict__

which gives us back:

{'_fit_X': array([[ 5.1,  3.5,  1.4,  0.2],
    ...
        [ 5.9,  3. ,  5.1,  1.8]]),
 '_fit_method': 'kd_tree',
 '_tree': <sklearn.neighbors.kd_tree.KDTree at 0x7ffe07902500>,
 '_y': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    ...
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 'algorithm': 'auto',
 'classes_': array([0, 1, 2]),
 'effective_metric_': 'euclidean',
 'effective_metric_params_': {},
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': 1,
 'n_neighbors': 5,
 'outputs_2d_': False,
 'p': 2,
 'radius': None,
 'weights': 'uniform'}

In this case, '_tree': <sklearn.neighbors.kd_tree.KDTree at 0x7ffe07902500> is not an object SimpleObjectCodec can encode or decode.

You have two options to deal with this.


Option 1: Avoid writing the codec by limiting the algorithm choice

A simple and quick solution, and a way to avoid writing a custom codec, is to add a parameter to the estimator to avoid using a KDTree:

self.estimator = KNeighborsClassifier(algorithm='brute', **out_params)


Option 2: Write a Custom Codec

If you must use a codec, you can save the KDTree state and reconstruct it using a custom codec. In Python terminal, run:

>>> kdtree_in_memory = classifier.__dict__['_tree']
>>> kdtree_in_memory.__getstate__()

which prints the state of "_tree" in classifier:

(array([[ 5.1,  3.5,  1.4,  0.2],
    ...
        [ 5.9,  3. ,  5.1,  1.8]]),
 array([  2,  13,  14,  16,  22,  35,  36,  38,  40,  41,  42,  49,  12,
    ...
        143, 144, 145, 107, 120, 102, 122]),
 array([(0, 150, 0, 10.29635857961444), (0, 75, 0, 3.5263295365010903),
        (75, 150, 0, 4.506106967216822), (0, 37, 1, 0.8774964387392121),
        (37, 75, 1, 3.0364452901377956), (75, 112, 1, 3.0401480227120525),
        (112, 150, 1, 2.874456470360963)],
       dtype=[('idx_start', '<i8'), ('idx_end', '<i8'), ('is_leaf', '<i8'), ('radius', '<f8')]),
 array([[[ 4.3,  2. ,  1. ,  0.1],
    ...
         [ 7.9,  3.8,  6.9,  2.5]]]),
 30,
 3,
 7,
 0,
 0,
 0,
 0,
 <sklearn.neighbors.dist_metrics.EuclideanDistance at 0x10d94d320>)

Most of the objects are numbers and arrays, which are covered by Python built-in and pre-registered codecs. At the end of the printed state, there is a second embedded object that is not supported by Python build-in or pre-registered codecs:

<sklearn.neighbors.dist_metrics.EuclideanDistance at 0x10d94d320>

We can investigate the state of the embedded object in Python terminal:

>>> dist_metric = kd_tree_in_memory.__getstate__()[-1]
>>> dist_metric.__getstate__()

which returns:

(2.0, array([ 0.]), array(0.))

Custom codec implementation

All of the codecs must inherit from BaseCodec in bin/codec/codecs.py.

Custom codec implemented based on BaseCodec is required to define two class methods - encode() and decode()

class KDTreeCodec(BaseCodec):
    @classmethod
    def encode(cls, obj):
        # Let's ensure the object is the one we think it is
        import sklearn.neighbors
        assert type(obj) == sklearn.neighbors.kd_tree.KDTree
 
        # Let's retrieve our state from our previous exploration
        state = obj.__getstate__()
 
        # Return a dictionary
        return {
            '__mlspl_type': [type(obj).__module__, type(obj).__name__],
            'state': state
        }
 
 
    @classmethod
    def decode(cls, obj):
        # Import the class we want to initialize
        from sklearn.neighbors.kd_tree import KDTree
 
 
        # Get our state from our saved obj
        state = obj['state']
 
        # Here is where we create the new object
        # doing whatever is required for this particular class
        t = KDTree.__new__(KDTree)
 
 
        # Set the state
        t.__setstate__(state)
 
 
        # And we're done!
        return t

Next, write a codec for sklearn.neighbors.dist_metrics.EuclideanDistance:

class EuclideanDistanceCodec(BaseCodec):
    @classmethod
    def encode(cls, obj):
        import sklearn.neighbors.dist_metrics
        assert type(obj) == sklearn.neighbors.dist_metrics.EuclideanDistance
 
        state = obj.__getstate__()
 
        return {
            '__mlspl_type': [type(obj).__module__, type(obj).__name__],
            'state': state
        }
 
 
    @classmethod
    def decode(cls, obj):
        import sklearn.neighbors.dist_metrics
 
        state = obj['state']
 
        d = sklearn.neighbors.dist_metrics.EuclideanDistance()
        d.__setstate__(state)
 
        return d

The last step is to make sure that all of the necessary codecs are registered in the register_codecs() method of the algorithm:

@staticmethod
def register_codecs():
    from codec.codecs import SimpleObjectCodec
    codecs_manager.add_codec('algos.KNClassifier', 'KNClassifier', SimpleObjectCodec)
    codecs_manager.add_codec('sklearn.neighbors.classification', 'KNeighborsClassifier', SimpleObjectCodec)
    codecs_manager.add_codec('sklearn.neighbors.kd_tree', 'KDTree', KDTreeCodec)
    codecs_manager.add_codec('sklearn.neighbors.dist_metrics', 'EuclideanDistance', EuclideanDistanceCodec)

Complete example

KNClassifier.py
#!/usr/bin/env python
 
from sklearn.neighbors import KNeighborsClassifier
 
from codec import codecs_manager
from codec.codecs import BaseCodec
 
from base import BaseAlgo, ClassifierMixin
from util.param_util import convert_params
 
 
class KNClassifier(ClassifierMixin, BaseAlgo):
 
    def __init__(self, options):
        self.handle_options(options)
        params = options.get('params', {})
        out_params = convert_params(
            params,
            ints=['k'],
            strs=['algorithm'],
            aliases={'k': 'n_neighbors'}
        )
 
        if 'algorithm' in out_params:
            if out_params['algorithm'] not in ['brute', 'KDTree']:
                raise RuntimeError("algorithm must be either 'brute' or 'KDTree'")
 
        self.estimator = KNeighborsClassifier(**out_params)
 
    @staticmethod
    def register_codecs():
        from codec.codecs import SimpleObjectCodec
        codecs_manager.add_codec('algos.KNClassifier', 'KNClassifier', SimpleObjectCodec)
        codecs_manager.add_codec('sklearn.neighbors.classification', 'KNeighborsClassifier', SimpleObjectCodec)
        codecs_manager.add_codec('sklearn.neighbors.kd_tree', 'KDTree', KDTreeCodec)
        codecs_manager.add_codec('sklearn.neighbors.dist_metrics', 'EuclideanDistance', EuclideanDistanceCodec)
 
 
class KDTreeCodec(BaseCodec):
    @classmethod
    def encode(cls, obj):
        import sklearn.neighbors
        assert type(obj) == sklearn.neighbors.kd_tree.KDTree
        state = obj.__getstate__()
        return {
            '__mlspl_type': [type(obj).__module__, type(obj).__name__],
            'state': state
        }
 
    @classmethod
    def decode(cls, obj):
        from sklearn.neighbors.kd_tree import KDTree
        state = obj['state']
        t = KDTree.__new__(KDTree)
        t.__setstate__(state)
        return t
 
 
class EuclideanDistanceCodec(BaseCodec):
    @classmethod
    def encode(cls, obj):
        import sklearn.neighbors.dist_metrics
        assert type(obj) == sklearn.neighbors.dist_metrics.EuclideanDistance
        state = obj.__getstate__()
        return {
            '__mlspl_type': [type(obj).__module__, type(obj).__name__],
            'state': state
        }
 
    @classmethod
    def decode(cls, obj):
        import sklearn.neighbors.dist_metrics
        state = obj['state']
        d = sklearn.neighbors.dist_metrics.EuclideanDistance()
        d.__setstate__(state)
        return d
Last modified on 03 July, 2019
Running process and method calling convention   Correlation Matrix

This documentation applies to the following versions of Splunk® Machine Learning Toolkit: 2.3.0, 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0, 4.0.0, 4.1.0, 4.2.0, 4.3.0


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters