Container management and scaling
The Splunk App for Data Science and Deep Learning leverages external containers to offload resource-heavy machine learning tasks from the Splunk search head. This architecture isolates potentially large workloads, and also allows for horizontal scaling, GPU acceleration, and robust environment management.
Review the following guidelines to manage container lifecycles, scale concurrency, and optimize resource usage when running DSDL in Docker, Kubernetes, or OpenShift.
Overview
When you run DSDL commands such as | fit MLTKContainer ...
or | apply ...
, DSDL communicates with an external container platform. DSDL communicates with Docker on a single host, or a cluster orchestrated by Kubernetes or OpenShift.
By default, development (DEV) mode containers include JupyterLab, TensorBoard, and other developer tools. And production (PROD) mode containers are minimal, running just the Python processes required for model training and inference.
Understanding how these containers are launched, monitored, and scaled can help with efficient resource usage and robust enterprise-grade workflows.
Container lifecycle
Review the following descriptions of a container lifecycle:
Lifecycle stage | Description |
---|---|
Launch triggers | User-initiated: When a user runs an ML-SPL command in the Splunk platform, DSDL checks if a container or pod is already running with the correct configuration. If not, a new one is launched.
The following is an example ML-SPL command: |
Scheduled searches: If a scheduled search includes the fit or apply commands, DSDL might spin up containers in the background at regular intervals.
| |
Manual start (optional): In DSDL under Configuration and then Containers, you can manually start or stop containers. This can be helpful for development tasks or when you need to use Jupyter. | |
Initialization | DSDL calls the Docker, Kubernetes, or OpenShift API to create a container or pod based on your chosen image. For example golden-cpu image.
The container environment variables are set in DSDL configuration. Example variables include |
Active and running | Once started, the container is available to handle fit or apply commands.
In DEV mode, JupyterLab or TensorBoard endpoints can be accessed through mapped ports, Routes, or NodePorts. |
Idle or stopped | After a period of inactivity and depending on your setup, DSDL might automatically stop the container to free resources. |
Removal and cleanup | Older or crashed containers are removed from the system. On Kubernetes or OpenShift pods are ephemeral by design and typically cleaned up automatically. |
About Docker, Kubernetes, and OpenShift
Review the following differences between Docker, Kubernetes, and OpenShift containers in DSDL:
Container | Typical use case | Network setup | Scaling | Management |
---|---|---|---|---|
Single-host Docker | Smaller dev/test environments or single Splunk instance with Docker on the same machine. | By default, the container might be on unix://var/run/docker.sock or tcp://localhost:2375 , no TLS.
|
Usually limited to one container or pod per host unless you manually script multiple Docker hosts. | DSDL starts and stops the container through the Docker API. Development containers might keep running until manually stopped. |
Kubernetes or OpenShift | Production-scale or distributed environments. | TLS/HTTPS from the Splunk platform to the Kubernetes or OpenShift API server on port 6443. | You can define multiple replicas, or rely on Kubernetes Horizontal Pod Autoscaler (HPA) for concurrency. | DSDL translates container requests into pod deployments. You can configure resource requests, node labeling (GPU), and advanced security contexts. |
Concurrency and scaling patterns
Review the following descriptions of concurrency and scaling patterns in DSDL:
Pattern | Description |
---|---|
Horizontal Pod Autoscaler (HPA) on Kubernetes or OpenShift |
HPA can auto-scale pods based on CPU or memory usage. For DSDL, you might define an HPA that spawns additional pods if usage surpasses certain thresholds. |
Docker Compose or scripts | If you're using single-host Docker, scaling typically involves manually launching multiple containers or writing scripts to do so. DSDL won't automatically create multiple containers on Docker unless you handle it outside of the Splunk platform. |
GPU scheduling | For GPU-based containers in Kubernetes or OpenShift, assign a label or request nvidia.com/gpu: 1 so that the container lands on GPU-enabled nodes.
In Docker, ensure |
About development and production containers
Review the following descriptions of development (DEV) and production (PROD) containers:
Container type | Description |
---|---|
DEV containers | JupyterLab for interactive notebook development. TensorBoard or other development tooling might be exposed on additional ports such as 8888 for Jupyter, or 6006 for TensorBoard. |
PROD containers | Minimal setup, containing only the necessary Python environment for model training and inference. No Jupyter or development ports exposed. Reduces the attack surface and resource overhead. |
DSDL operations overview
Review this overview of the DSDL operations you can access within the DSDL user interface (UI):
UI component | Description |
---|---|
Container dashboard | In DSDL and then Configuration, and then Containers. You can see currently running containers/pods, start/stop them, or open Jupyter/TensorBoard if in dev mode. |
Logs and diagnostics | Splunk _internal logs with "mltk-container" can reveal container startup errors, e.g., network timeouts or Docker/K8s rejections. Container logs can be forwarded to Splunk, letting you see Python exceptions from fit or apply. |
Cleanup | DSDL typically stops idle containers after a certain timeout. This is configurable in the DSDL app. In Kubernetes and OpenShift, old pods might remain in a "Completed" or "Terminated" state, but cluster housekeeping eventually prunes them. |
Resource allocation and scheduling
Review the following options for resource allocation and scheduling in DSDL:
Option | Description |
---|---|
CPU and memory requests | In Kubernetes, define requests and limits in your pod or deployment specification. This ensures the container won't exceed certain CPU/memory usage.
On Docker, you can specify |
GPU resources | On Kubernetes or OpenShift, you must configure GPU node drivers or device plugins such as NVIDIA so that pods requesting nvidia.com/gpu: 1 schedule properly.On Docker, use the |
Production best practices | Keep an eye on ephemeral storage usage. Large data staging or logs can fill ephemeral volumes.
Monitor container logs for OOM killer events, which indicate insufficient memory limits. |
Typical container management and scaling use cases
See the following common use cased for container management and scaling:
Use case | Desription |
---|---|
Multiple development (DEV) containers | Each data scientist spawns a personal DEV container with Jupyter. They eventually merge and store code in Git. |
1 production container in Docker | Single-host environment with moderate concurrency. The Splunk platform calls 1 container to handle model training and inference sequentially. |
Kubernetes high-performance computing (HPC) | Large-scale HPC environment with multiple GPU nodes. Kubernetes auto-scales pods so concurrent machine learning tasks each get their own container. |
OpenShift Enterprise Container Platform | Similar HPC environment but integrated with Red Hat's security or operator frameworks. |
Troubleshooting container management and scaling
See the following issues you might experience and how to resolve them:
Issue | Resolution |
---|---|
Container fails to start | 1. Check _internal logs for mltk-container messages about Docker or Kubernetes API errors or timeouts.
2. Ensure the Docker REST socket or Kubernetes API server is reachable. |
Development (DEV) container times out | If DEV containers auto-stop, adjust the idle timeout or manually keep them active in the DSDL UI. |
Resource exhaustion | If logs indicate out of memory (OOM) kills, or CPU throttling, raise the memory or cpu requests in Kubernetes, or refine Docker resource constraints. |
GPU not recognized | Confirm that you have the correct GPU drivers, device plugins, or runtime=nvidia if using Docker.
Check container logs or |
Example: Kubernetes multi-pod setup
The following steps are for an example multi-pod setup in Kubernetes:
- Configure DSDL: In DSDL go to Setup, then Connect to your Kubernetes cluster.
- Define a Deployment: DSDL automatically creates a deployment for dev or prod containers. You can edit the resource specs or add a Horizontal Pod Autoscaler as follows:
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: dsdl-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: dsdl-dev minReplicas: 1 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
- Use DSDL: The Splunk platform calls the
fit
orapply
commands, and DSDL spawns pods. If CPU usage is high, Horizontal Pod Autoscaler (HPA) scales up. - Observe container states: In the Kubernetes dashboard or on the DSDL Container page, you can see how many pods are active.
Container monitoring and logging | Advanced container customization |
This documentation applies to the following versions of Splunk® App for Data Science and Deep Learning: 5.2.1
Feedback submitted, thanks!