Extend the Splunk App for Data Science and Deep Learning with custom notebooks
The Splunk App for Data Science and Deep Learning (DSDL) allows you to define custom notebooks for specialized machine learning or deep learning tasks. By writing your own Jupyter notebooks, you can incorporate custom algorithms, advanced Python libraries, domain-specific logic, and pull in data from the Splunk platform within the same environment.
Learn how to create, export, and maintain notebooks so that they seamlessly integrate with the ML-SPL commands of fit
, apply
, and summary
.
Overview
When you develop a notebook in DSDL you can perform the following tasks:
- Write Python code for data preprocessing, model training, or inference.
- Expose that code to ML-SPL by defining functions such as
fit
andapply
within special notebook cells. - Automatically export the code into a Python module at runtime.
- Call those functions from Splunk platform searches as shown in this example:
| fit MLTKContainer algo=my_notebook ... into app:MyModel | apply my_notebook
- Pull data directly from the Splunk platform using the Splunk Search API integration, allowing for more interactive data exploration within your Jupyter notebook environment.
Your custom code operates in the external container environment while staying fully integrated with Splunk platform search processing.
DSDL notebook components
A DSDL notebook typically includes the following components:
Component | Description |
---|---|
Imports and setup | Import common libraries such as NumPy, Pandas, and PyTorch. Can define global constants or utility functions. |
fit function
|
A Python function that trains or fits your model. Accepts data as a Pandas DataFrame and hyperparameters, returning model artifacts. |
apply function (optional)
|
Used for inference or prediction. Accepts new data and the trained model, and returns predictions. |
summary function (optional)
|
Provides metadata about the model such as hyperparameters or training stats. |
Other utility functions (optional) | Data cleaning, advanced transforms, or direct data pulls using the Splunk Search API. |
When you save a notebook, DSDL automatically generates a corresponding .PY file in /srv/notebooks/app/ directory or a similar directory. The corresponding .py file uses the same base name as the notebook, for example my_notebook.py
. Once saved, the fit
, apply
, and summary
functions become callable from ML-SPL.
The following is an example notebook comprised of different components:
# --- # jupyter: # jupytext: # formats: ipynb,py # notebook_metadata_filter: all # --- import json import numpy as np import pandas as pd import os MODEL_DIRECTORY = "/srv/app/model/data/" from dsdlsupport import SplunkSearch as SplunkSearch search = SplunkSearch.SplunkSearch() df = search.as_df() df def init(df,param): model = {} model['hyperparameter'] = 42.0 return model model = init(df,param) print(model) def fit(model,df,param): info = {"message": "model trained"} return info print(fit(model,df,param)) def apply(model,df,param): y_hat = df.index result = pd.DataFrame(y_hat, columns=['index']) return result print(apply(model,df,param)) def summary(model=None): returns = {"version": {"numpy": np.__version__, "pandas": pd.__version__} } return returns
Notebook-to-module mechanism
DSDL runs the following internal mechanism that scans the notebook for functions named fit
, apply
, and summary
:
- Autosave trigger: Each time you save the notebook in JupyterLab, a conversion step occurs.
- Export Python: Relevant Python cells such as a cell containing the
fit
function, are written into a .PY module. For example/srv/notebooks/app/<notebook_name>.py
. - ML-SPL lookup: The
MLTKContainer
command dynamically imports<notebook_name>
at runtime to invoke thefit
,apply
, andsummary
functions.
You can help ensure this internal mechanism runs well in the following ways:
- Avoid function name collisions. For example, two separate
fit
definitions in the same notebook. - If you rename your notebook file, a new .PY module is created. Older references might linger unless you remove them.
Defining and passing parameters
Document your notebook's expected parameters so users know which SPL arguments to provide. Use sensible defaults to avoid a Python KeyError if a parameter (param) is missing.
In the following example, all ML-SPL arguments after algo=<my_notebook>
are passed to your notebook's Python code as the param dictionary:
| fit MLTKContainer algo=my_notebook alpha=0.01 epochs=10 ...
def fit(df, param): alpha = float(param.get('alpha', 0.001)) epochs = int(param.get('epochs', 10)) ...
Use param.get('key', default_value)
to handle optional arguments.
Using iterative development
You can use mode=stage
for iterative development. Complete the following steps:
- Data staging: If you only want to push a subset of Splunk platform data to your notebook without training, follow this syntax:
This sends .CSV data to the container but does not call the
| fit MLTKContainer mode=stage algo=my_notebook features_* into app:MyDevModel
fit
command. - You can then open JupyterLab, and define or call a helper function as follows:
def stage(name): with open("data/"+name+".csv", 'r') as f: df = pd.read_csv(f) with open("data/"+name+".json", 'r') as f: param = json.load(f) return df, param df, param = stage("MyDevModel")
- Debugging: Open my_notebook.ipynb in JupyterLab to test or modify code, using the staged data.
- Manually call your
init
,fit
, orapply
functions on that data to debug as needed.
Pulling data directly using the Splunk Search API
In addition to staging data with mode=stage
, you can pull data directly using the Splunk Search API.
Complete the following steps:
- Turn on access to the Splunk platform on the DSDL Setup page. Provide your Splunk host, port 8089, and a valid token.
- Import
SplunkSearch
in your notebook, then either use an interactive search widget or define a predefined query. - Execute the query to retrieve data into a pandas DataFrame right inside your notebook.
Example:from dsdlsupport import SplunkSearch search = SplunkSearch.SplunkSearch() df = search.as_df() df
If you encounter connectivity issues, confirm firewall rules or check the
_internal
logs formltk-container
errors referencing timeouts.
Version control and collaboration
Review the following methods for version control and collaboration with custom models:
- Git Integration: Store notebooks in a Git repo, allowing for merges, pull requests, and versioning.
- Notebook storage: By default, notebooks are stored in
/srv/notebooks/
. You can sort them by projects or by teams. - Autosave: Jupyter saves automatically, but you can consider manually committing .IPYNB and .PY changes to Git for auditing.
Advanced notebook patterns
Review the following advanced notebook patterns you can use with custom models:
Notebook pattern | Description |
---|---|
Multiple models per notebook | You can define multiple training algorithms in a single .IPYNB, but only one fit method is recognized. If you want to differentiate between them, parse extra arguments in param or create separate notebooks for clarity.
|
Additional utility functions | You can define custom data preparation, feature engineering, or advanced plotting in separate Python cells. As long as they're not named fit, apply, or summary, they won't be exported to ML-SPL. |
Auto-generating additional metrics | You can log metrics or epoch-by-epoch logs to the Splunk platform. For example by writing them to a .CSV that's forwarded, or sending them to HEC in real time. |
Custom notebook best practices
Consider the following when creating custom notebooks:
- Function naming: DSDL only recognizes exact cell names. Be mindful of any typos when using
init
,fit
,apply
, andsummary
. - Parameter values: All parameter values from ML-SPL are strings. You can convert to
int
,float
, orbool
as needed. - Large libraries: If your container image lacks large libraries, it results in an
ImportError
. Add large libraries through Docker. - Notebook naming: Use unique .IPYNV filenames to help avoid conflicts or overwriting in
/srv/notebooks/app/
. - Connectivity: If you rely on Splunk Search, ensure the container can reach the Splunk platform firewall, DNS, and TLS settings.
Example: Creating a custom notebook
The following is an example workflow of creating a custom notebook:
- Open JupyterLab: Start a dev container in DSDL, then open JupyterLab.
- Create a notebook: Save it as
my_custom_algo.ipynb
in JupyterLab. - Define code: Write cells for
init
,fit
,apply
, andsummary
, optionally using the Splunk Search API. - Pull data: Use
df
,param = stage("MyTestModel")
or use Splunk Search.
Test logic interactively. - Save: DSDL exports your code to
my_custom_algo.py
. - Train in Splunk:
index=my_data | fit MLTKContainer algo=my_custom_algo features_* into app:MyProdModel
- Apply:
index=my_data | apply my_custom_algo
Advanced HPC and GPU usage | Container monitoring and logging |
This documentation applies to the following versions of Splunk® App for Data Science and Deep Learning: 5.2.1
Feedback submitted, thanks!