Model governance and security in the Splunk App for Data Science and Deep Learning
The Splunk App for Data Science and Deep Learning (DSDL) lets you to train advanced machine learning models in containerized environments. However, enterprise-grade machine learning might require model governance, secure container management, and strict access controls to ensure that data, models, and container images meet compliance and operational standards.
Learn how to handle model governance and security in your DSDL environment and how DSDL automatically persists notebooks and models to protect against ephemeral container storage losses.
Overview
DSDL extends the Splunk Machine Learning Toolkit (MLTK) with container-based execution. While MLTK offers access controls for model artifacts, DSDL amplifies this with external container images and Jupyter-based notebooks.
DSDL supports the following model governance features:
- Model permissions in Splunk including global, app-level, and user-level.
- Automatic sync of notebooks and models onto the Splunk instance, preventing data loss in ephemeral container volumes or NFS shares.
- Container image security including private registries, image scanning, restricted GPU usage, plus custom TLS certificates.
- Data encryption and TLS between Splunk and container traffic.
- Model lifecycle including versioning, auditing, re-deploying, or rolling back models.
Model lifecycle and sharing
Review the following for steps in the lifecycle of a DSDL model.
Model creation and training
Run the following to create and train a new model:
| fit MLTKContainer algo=my_notebook ... into app:MyModel
DSDL spins up a container, executes training, and saves model artifacts under app:MyModel
.
The model is stored in the container environment during training, but references appear in the Splunk platform.
Model permissions
The following permissions are available with your models:
Permissions | Description |
---|---|
App context | By default, model names such as app:MyModel are recognized by DSDL.
|
Sharing | Splunk knowledge object sharing can be set to User, App, or Global. |
User | Visible only to the model creator. |
App | Shared by users of the same Splunk app. |
Global | Visible across the Splunk platform and suitable for widely used HPC or production models. |
Model retraining or versioning
Re-run the following with new data or parameters. This overwrites old artifacts:
| fit MLTKContainer algo=my_notebook ... into app:MyModel
For a separate version , for example MyModel_v2
, specify a new name in the into app: clause
.
Keep ML-SPL plus .ipynb code in Git. If the new version is suboptimal, you can revert easily.
Automatic notebook to model sync
DSDL automatically persists your notebooks and model files onto the Splunk platform instance. This prevents data loss if ephemeral or NFS volumes go offline and lets new containers retrieve the same notebooks and models.
Automation relies on internal sync scripts such asSyncHandler that kill orphaned containers and reconcile model stanzas.
Container image security
Review the following options to secure your container images.
Private registry and air-gapped images
You can use a private Docker registry or an air-gapped approach. Push images from golden-cpu, golden-gpu, or custom to your internal registry. In DSDL go to Setup and then Container Settings, and specify that private registry URL so DSDL pulls from it.
Use docker save/load or bulk_build.sh if your environment has no internet. Keep a separate Git or artifact repo with Dockerfiles and pinned requirements.
Image scanning and hardening
Use scripts from [splunk-mltk-container-docker](#)
or tools such as Trivy to detect known common vulnerabilities and exposures (CVE).
You can remove unneeded packages for minimal images.
Patch OS-level vulnerabilities regularly (Debian, Red Hat UBI, etc.).
GPU resource restrictions
In Kubernetes or OpenShift, define resource requests so only authorized machine learning tasks can claim GPUs.
In single-host Docker, pass --gpus
or runtime=nvidia
to control GPU usage.
Embedding custom certificates for production HTTPS
In production, you will need trusted HTTPS on container endpoints. DSDL images can include your own TLS certificates instead of the default, self-signed certificates. The splunk-mltk-container-docker repo includes a certificates folder demonstrating how to embed custom certificates.
For development environments, a self-signed certificate can suffice. For production, you might want your organization's CA-signed certificate.
Follow these steps:
- Clone the repo:
git clone https://github.com/splunk/splunk-mltk-container-docker
- Place your certificates in the certificates directory, named dltk.key (private key) and dltk.pem (certificate).
- (Optional) Generate self-signed certs for testing:
Or create your own CA-signed certs with the same filenames and dimensions.
openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \ -keyout dltk.key -out dltk.pem \ -subj "/CN=bobobobobbo"
- Build your container image using scripts:
The Docker build automatically copies dltk.key and dltk.pem into /dltk/.jupyter/. This sets up the container to serve HTTPS endpoints with your certificate.
./build.sh golden-cpu-custom splunk/ 5.2.0
Make sure the file names remain dltk.key and dltk.pem or adapt the Dockerfile references so the container recognizes them. Only these exact filenames are used at runtime.
Roles, capabilities, and container access
Review the following for information on roles and permissions in DSDL.
DSDL roles and capabilities
DSDL offers the following container-related capabilities:
- configure_mltk_container: Manage container settings (Observability tokens, cert configs).
- list_mltk_container: List containers on the container dashboard.
- control_mltk_container: Start/stop containers from DSDL UI.
You can consider limiting "configure_mltk_container" capabilities for Splunk admins, "control_mltk_container" for data-science roles, and "list_mltk_container" for general usage.
Model permissions
By default, only the model creator sees the model. For HPC or large production usage, set model sharing to "Global."
Secure HEC, Observability, and container endpoints
Use Splunk HEC tokens carefully if you log partial training data. If Observability is enabled, guard your Observability Access Token. If you want production-level TLS in the container, use embedding custom certificates.
Auditing and traceability
Review the following options for model auditing and traceability.
Use _internal logs for model creation
Use to help track who trained which model and when.
When you run | fit ... into app:MyModel
, logs appear in _internal
, referencing information including container staging.
Example:
index=_internal "mltk-container" "into=app:MyModel"
Use model summary and metadata
Running | summary MyModel
returns model information such as hyperparams and creation time.
You can build a "model catalog" or store these events in a dedicated Splunk index for extended auditing.
Use notebook versioning in Git
DSDL automatically syncs notebooks to the Splunk platform , but you can also store .ipynb files in Git for collaboration and rollback.
TLS and data encryption
Review the following table for information on Transport Layer Security (TLS) and data encryption in model governance and security:
Option | Description |
---|---|
TLS from Splunk to container | Dev containers might use self-signed certs. Production containers must have properly signed certificates.
If using Docker single-host, the container endpoints themselves handle TLS. For Kubernetes, often an Ingress handles TLS termination. |
GPU data in transit | Data from the Splunk platform is still subject to TLS encryption, even if the container uses GPUs.
The ephemeral GPU usage does not affect encryption but does matter for ephemeral volumes, mitigated by the sync to the Splunk platform. |
Automatic notebook and model sync
Containers are ephemeral by default. If ephemeral volumes or NFS shares go down, you risk losing code or trained models. The DSDL internal "sync" scripts store notebooks and models on the Splunk instance. If containers vanish or fail, you can re-launch them and retrieve the same notebooks/models.
The "SyncHandler", plus related scripts, kill orphaned containers, reconcile stanzas with actual containers, and ensure ephemeral data is re-synced. This preserves your environment from data loss, letting you focus on the machine learning workflow, rather than container lifecycle details.
Governance and security guidelines
Review the following guidelines for model governance and security:
- Least privilege: Restrict advanced container management capabilities to admin or power users.
Use minimal images, adding only the libraries you need. - Notebook plus model sync plus Git: Rely on DSDL's automatic sync to avoid ephemeral data loss, but store .ipynb in Git for version control.
- Scan container images: Use Trivy or the built-in scripts from splunk-mltk-container-docker.
- Custom certificates: For production HTTPS in containers, place dltk.key and dltk.pem in certificates/. Use openssl req -x509 for a quick way to generate a self-signed pair for development.
- For real certs, rename or place them with the same file names so your container's Dockerfile picks them up.
- Observability: If Observability is toggled on in DSDL, container endpoints are auto-instrumented with Otel. Confirm your endpoint, token, and service name.
Troubleshooting model governance and security
See the following table for issues you might experience and how to resolve them:
Issue | Cause | How to investigate |
---|---|---|
"model not found: MyModel" | Model is private or in a different app context. | Adjust sharing or confirm container logs. Possibly search _internal for "mltk-container" references to your model. |
HPC node can't pull image | Private registry or TLS error. | Re-check your Docker/ or Kubernetes credentials, or your images.conf references to the registry. |
Observability instrumentation not active on endpoints | Observability toggled off or invalid token in DSDL under Setup, and then Observability Settings. | Revisit the Setup page. The container might need a restart with new config. |
Notebooks vanish after container restarts | Ephemeral volume wiped or NFS gone. | Automatic Splunk-side sync should restore them. Check _internal "mltk-container" for any sync errors. |
"Invalid certificate" on container endpoint | Using self-signed or misnamed cert, or the container lacking your official CA. | Place your real cert in certificates/dltk.pem + dltk.key and rebuild container. Review Docker logs for TLS load errors. |
Advanced container customization | Using the Neural Network Designer Assistant |
This documentation applies to the following versions of Splunk® App for Data Science and Deep Learning: 5.2.1
Feedback submitted, thanks!