Extend the Splunk App for Data Science and Deep Learning with custom notebooks

The Splunk App for Data Science and Deep Learning (DSDL) allows you to define custom notebooks for specialized machine learning or deep learning tasks. By writing your own Jupyter notebooks, you can incorporate custom algorithms, advanced Python libraries, domain-specific logic, and pull in data from the Splunk platform within the same environment.

Learn how to create, export, and maintain notebooks so that they seamlessly integrate with the ML-SPL commands of fit, apply, and summary.

Overview

When you develop a notebook in DSDL you can perform the following tasks:

Write Python code for data preprocessing, model training, or inference.
Expose that code to ML-SPL by defining functions such as fit and apply within special notebook cells.
Automatically export the code into a Python module at runtime.

Call those functions from Splunk platform searches as shown in this example:

| fit MLTKContainer algo=my_notebook ... into app:MyModel
| apply my_notebook

Pull data directly from the Splunk platform using the Splunk Search API integration, allowing for more interactive data exploration within your Jupyter notebook environment.

Your custom code operates in the external container environment while staying fully integrated with Splunk platform search processing.

DSDL notebook components

A DSDL notebook typically includes the following components:

Component	Description
Imports and setup	Import common libraries such as NumPy, Pandas, and PyTorch. Can define global constants or utility functions.
`fit` function	A Python function that trains or fits your model. Accepts data as a Pandas DataFrame and hyperparameters, returning model artifacts.
`apply` function (optional)	Used for inference or prediction. Accepts new data and the trained model, and returns predictions.
`summary` function (optional)	Provides metadata about the model such as hyperparameters or training stats.
Other utility functions (optional)	Data cleaning, advanced transforms, or direct data pulls using the Splunk Search API.

When you save a notebook, DSDL automatically generates a corresponding .PY file in /srv/notebooks/app/ directory or a similar directory. The corresponding .py file uses the same base name as the notebook, for example my_notebook.py. Once saved, the fit, apply, and summary functions become callable from ML-SPL.

The following is an example notebook comprised of different components:

# ---
# jupyter:
#   jupytext:
#     formats: ipynb,py
#     notebook_metadata_filter: all
# ---

import json
import numpy as np
import pandas as pd
import os

MODEL_DIRECTORY = "/srv/app/model/data/"

from dsdlsupport import SplunkSearch as SplunkSearch
search = SplunkSearch.SplunkSearch()
df = search.as_df()
df

def init(df,param):
    model = {}
    model['hyperparameter'] = 42.0
    return model

model = init(df,param)
print(model)

def fit(model,df,param):
    info = {"message": "model trained"}
    return info

print(fit(model,df,param))

def apply(model,df,param):
    y_hat = df.index
    result = pd.DataFrame(y_hat, columns=['index'])
    return result

print(apply(model,df,param))

def summary(model=None):
    returns = {"version": {"numpy": np.__version__, "pandas": pd.__version__} }
    return returns

Notebook-to-module mechanism

DSDL runs the following internal mechanism that scans the notebook for functions named fit, apply, and summary:

Autosave trigger: Each time you save the notebook in JupyterLab, a conversion step occurs.
Export Python: Relevant Python cells such as a cell containing the fit function, are written into a .PY module. For example /srv/notebooks/app/<notebook_name>.py.
ML-SPL lookup: The MLTKContainer command dynamically imports <notebook_name> at runtime to invoke the fit, apply, and summary functions.

You can help ensure this internal mechanism runs well in the following ways:

Avoid function name collisions. For example, two separate fit definitions in the same notebook.
If you rename your notebook file, a new .PY module is created. Older references might linger unless you remove them.

Defining and passing parameters

Document your notebook's expected parameters so users know which SPL arguments to provide. Use sensible defaults to avoid a Python KeyError if a parameter (param) is missing.

In the following example, all ML-SPL arguments after algo=<my_notebook> are passed to your notebook's Python code as the param dictionary:

| fit MLTKContainer algo=my_notebook alpha=0.01 epochs=10 ...

def fit(df, param):
    alpha = float(param.get('alpha', 0.001))
    epochs = int(param.get('epochs', 10))
    ...

Use param.get('key', default_value) to handle optional arguments.

Using iterative development

You can use mode=stage for iterative development. Complete the following steps:

Data staging: If you only want to push a subset of Splunk platform data to your notebook without training, follow this syntax:
```
 | fit MLTKContainer mode=stage algo=my_notebook features_* into app:MyDevModel
```
This sends .CSV data to the container but does not call the fit command.

You can then open JupyterLab, and define or call a helper function as follows:

def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

df, param = stage("MyDevModel")

Debugging: Open my_notebook.ipynb in JupyterLab to test or modify code, using the staged data.
Manually call your init, fit, or apply functions on that data to debug as needed.

Pulling data directly using the Splunk Search API

In addition to staging data with mode=stage, you can pull data directly using the Splunk Search API.

Complete the following steps:

Turn on access to the Splunk platform on the DSDL Setup page. Provide your Splunk host, port 8089, and a valid token.
Import SplunkSearch in your notebook, then either use an interactive search widget or define a predefined query.
Execute the query to retrieve data into a pandas DataFrame right inside your notebook.
Example:
```
from dsdlsupport import SplunkSearch
search = SplunkSearch.SplunkSearch()
df = search.as_df()
df
```
If you encounter connectivity issues, confirm firewall rules or check the _internal logs for mltk-container errors referencing timeouts.

Version control and collaboration

Review the following methods for version control and collaboration with custom models:

Git Integration: Store notebooks in a Git repo, allowing for merges, pull requests, and versioning.
Notebook storage: By default, notebooks are stored in /srv/notebooks/. You can sort them by projects or by teams.
Autosave: Jupyter saves automatically, but you can consider manually committing .IPYNB and .PY changes to Git for auditing.

Advanced notebook patterns

Review the following advanced notebook patterns you can use with custom models:

Notebook pattern	Description
Multiple models per notebook	You can define multiple training algorithms in a single .IPYNB, but only one `fit` method is recognized. If you want to differentiate between them, parse extra arguments in `param` or create separate notebooks for clarity.
Additional utility functions	You can define custom data preparation, feature engineering, or advanced plotting in separate Python cells. As long as they're not named fit, apply, or summary, they won't be exported to ML-SPL.
Auto-generating additional metrics	You can log metrics or epoch-by-epoch logs to the Splunk platform. For example by writing them to a .CSV that's forwarded, or sending them to HEC in real time.

Custom notebook best practices

Consider the following when creating custom notebooks:

Function naming: DSDL only recognizes exact cell names. Be mindful of any typos when using init, fit, apply, and summary.
Parameter values: All parameter values from ML-SPL are strings. You can convert to int, float, or bool as needed.
Large libraries: If your container image lacks large libraries, it results in an ImportError. Add large libraries through Docker.
Notebook naming: Use unique .IPYNV filenames to help avoid conflicts or overwriting in /srv/notebooks/app/.
Connectivity: If you rely on Splunk Search, ensure the container can reach the Splunk platform firewall, DNS, and TLS settings.

Example: Creating a custom notebook

The following is an example workflow of creating a custom notebook:

Open JupyterLab: Start a dev container in DSDL, then open JupyterLab.
Create a notebook: Save it as my_custom_algo.ipynb in JupyterLab.
Define code: Write cells for init, fit, apply, and summary, optionally using the Splunk Search API.
Pull data: Use df, param = stage("MyTestModel") or use Splunk Search.
Test logic interactively.
Save: DSDL exports your code to my_custom_algo.py.

Train in Splunk:

index=my_data
| fit MLTKContainer algo=my_custom_algo features_* into app:MyProdModel

Apply:
```
index=my_data
| apply my_custom_algo
```

Related answers from Splunk Community

Extend the Splunk App for Data Science and Deep Learning with custom notebooks

Overview

DSDL notebook components

Notebook-to-module mechanism

Defining and passing parameters

Using iterative development

Pulling data directly using the Splunk Search API

Version control and collaboration

Advanced notebook patterns

Custom notebook best practices

Example: Creating a custom notebook

Comments

Extend the Splunk App for Data Science and Deep Learning with custom notebooks

Was this topic useful?