Container management and scaling

The Splunk App for Data Science and Deep Learning leverages external containers to offload resource-heavy machine learning tasks from the Splunk search head. This architecture isolates potentially large workloads, and also allows for horizontal scaling, GPU acceleration, and robust environment management.

Review the following guidelines to manage container lifecycles, scale concurrency, and optimize resource usage when running DSDL in Docker, Kubernetes, or OpenShift.

Overview

When you run DSDL commands such as | fit MLTKContainer ... or | apply ..., DSDL communicates with an external container platform. DSDL communicates with Docker on a single host, or a cluster orchestrated by Kubernetes or OpenShift.

By default, development (DEV) mode containers include JupyterLab, TensorBoard, and other developer tools. And production (PROD) mode containers are minimal, running just the Python processes required for model training and inference.

Understanding how these containers are launched, monitored, and scaled can help with efficient resource usage and robust enterprise-grade workflows.

Container lifecycle

Review the following descriptions of a container lifecycle:

Lifecycle stage	Description
Launch triggers	User-initiated: When a user runs an ML-SPL command in the Splunk platform, DSDL checks if a container or pod is already running with the correct configuration. If not, a new one is launched. The following is an example ML-SPL command: `\| fit MLTKContainer algo=...`
	Scheduled searches: If a scheduled search includes the `fit` or `apply` commands, DSDL might spin up containers in the background at regular intervals.
	Manual start (optional): In DSDL under Configuration and then Containers, you can manually start or stop containers. This can be helpful for development tasks or when you need to use Jupyter.
Initialization	DSDL calls the Docker, Kubernetes, or OpenShift API to create a container or pod based on your chosen image. For example `golden-cpu` image. The container environment variables are set in DSDL configuration. Example variables include `JUPYTER_PASSWD` and `container_enable_https`.
Active and running	Once started, the container is available to handle `fit` or `apply` commands. In DEV mode, JupyterLab or TensorBoard endpoints can be accessed through mapped ports, Routes, or NodePorts.
Idle or stopped	After a period of inactivity and depending on your setup, DSDL might automatically stop the container to free resources.
Removal and cleanup	Older or crashed containers are removed from the system. On Kubernetes or OpenShift pods are ephemeral by design and typically cleaned up automatically.

About Docker, Kubernetes, and OpenShift

Review the following differences between Docker, Kubernetes, and OpenShift containers in DSDL:

Container	Typical use case	Network setup	Scaling	Management
Single-host Docker	Smaller dev/test environments or single Splunk instance with Docker on the same machine.	By default, the container might be on `unix://var/run/docker.sock` or `tcp://localhost:2375`, no TLS.	Usually limited to one container or pod per host unless you manually script multiple Docker hosts.	DSDL starts and stops the container through the Docker API. Development containers might keep running until manually stopped.
Kubernetes or OpenShift	Production-scale or distributed environments.	TLS/HTTPS from the Splunk platform to the Kubernetes or OpenShift API server on port 6443.	You can define multiple replicas, or rely on Kubernetes Horizontal Pod Autoscaler (HPA) for concurrency.	DSDL translates container requests into pod deployments. You can configure resource requests, node labeling (GPU), and advanced security contexts.

Concurrency and scaling patterns

Review the following descriptions of concurrency and scaling patterns in DSDL:

Pattern	Description
Horizontal Pod Autoscaler (HPA) on Kubernetes or OpenShift	HPA can auto-scale pods based on CPU or memory usage. For DSDL, you might define an HPA that spawns additional pods if usage surpasses certain thresholds. This helps handle multiple concurrent Splunk searches that call the `fit` or `apply` commands simultaneously.
Docker Compose or scripts	If you're using single-host Docker, scaling typically involves manually launching multiple containers or writing scripts to do so. DSDL won't automatically create multiple containers on Docker unless you handle it outside of the Splunk platform.
GPU scheduling	For GPU-based containers in Kubernetes or OpenShift, assign a label or request `nvidia.com/gpu: 1` so that the container lands on GPU-enabled nodes. In Docker, ensure `--gpus` all or similar flags are used if you want GPU acceleration.

About development and production containers

Review the following descriptions of development (DEV) and production (PROD) containers:

Container type	Description
DEV containers	JupyterLab for interactive notebook development. TensorBoard or other development tooling might be exposed on additional ports such as 8888 for Jupyter, or 6006 for TensorBoard. Typically used short-term to refine code, test data staging, or debug advanced logic.
PROD containers	Minimal setup, containing only the necessary Python environment for model training and inference. No Jupyter or development ports exposed. Reduces the attack surface and resource overhead. Might run on multiple replicas if concurrency is high.

DSDL operations overview

Review this overview of the DSDL operations you can access within the DSDL user interface (UI):

UI component	Description
Container dashboard	In DSDL and then Configuration, and then Containers. You can see currently running containers/pods, start/stop them, or open Jupyter/TensorBoard if in dev mode.
Logs and diagnostics	Splunk _internal logs with "mltk-container" can reveal container startup errors, e.g., network timeouts or Docker/K8s rejections. Container logs can be forwarded to Splunk, letting you see Python exceptions from fit or apply.
Cleanup	DSDL typically stops idle containers after a certain timeout. This is configurable in the DSDL app. In Kubernetes and OpenShift, old pods might remain in a "Completed" or "Terminated" state, but cluster housekeeping eventually prunes them.

Resource allocation and scheduling

Review the following options for resource allocation and scheduling in DSDL:

Option	Description
CPU and memory requests	In Kubernetes, define requests and limits in your pod or deployment specification. This ensures the container won't exceed certain CPU/memory usage. On Docker, you can specify `--cpus` or `--memory` if you manually run containers.
GPU resources	On Kubernetes or OpenShift, you must configure GPU node drivers or device plugins such as NVIDIA so that pods requesting `nvidia.com/gpu: 1` schedule properly. On Docker, use the `runtime=nvidia` setting or an environment variable to link GPU libraries.
Production best practices	Keep an eye on ephemeral storage usage. Large data staging or logs can fill ephemeral volumes. Monitor container logs for OOM killer events, which indicate insufficient memory limits.

Typical container management and scaling use cases

See the following common use cased for container management and scaling:

Use case	Desription
Multiple development (DEV) containers	Each data scientist spawns a personal DEV container with Jupyter. They eventually merge and store code in Git.
1 production container in Docker	Single-host environment with moderate concurrency. The Splunk platform calls 1 container to handle model training and inference sequentially.
Kubernetes high-performance computing (HPC)	Large-scale HPC environment with multiple GPU nodes. Kubernetes auto-scales pods so concurrent machine learning tasks each get their own container.
OpenShift Enterprise Container Platform	Similar HPC environment but integrated with Red Hat's security or operator frameworks.

Troubleshooting container management and scaling

See the following issues you might experience and how to resolve them:

Issue	Resolution
Container fails to start	1. Check `_internal` logs for `mltk-container` messages about Docker or Kubernetes API errors or timeouts. 2. Ensure the Docker REST socket or Kubernetes API server is reachable.
Development (DEV) container times out	If DEV containers auto-stop, adjust the idle timeout or manually keep them active in the DSDL UI.
Resource exhaustion	If logs indicate out of memory (OOM) kills, or CPU throttling, raise the memory or cpu requests in Kubernetes, or refine Docker resource constraints.
GPU not recognized	Confirm that you have the correct GPU drivers, device plugins, or `runtime=nvidia` if using Docker. Check container logs or `_internal` for GPU scheduling errors.

Example: Kubernetes multi-pod setup

The following steps are for an example multi-pod setup in Kubernetes:

Configure DSDL: In DSDL go to Setup, then Connect to your Kubernetes cluster.

Define a Deployment: DSDL automatically creates a deployment for dev or prod containers. You can edit the resource specs or add a Horizontal Pod Autoscaler as follows:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: dsdl-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: dsdl-dev
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Use DSDL: The Splunk platform calls the fit or apply commands, and DSDL spawns pods. If CPU usage is high, Horizontal Pod Autoscaler (HPA) scales up.
Observe container states: In the Kubernetes dashboard or on the DSDL Container page, you can see how many pods are active.

Related answers from Splunk Community

Container management and scaling

Overview

Container lifecycle

About Docker, Kubernetes, and OpenShift

Concurrency and scaling patterns

About development and production containers

DSDL operations overview

Resource allocation and scheduling

Typical container management and scaling use cases

Troubleshooting container management and scaling

Example: Kubernetes multi-pod setup

Comments

Container management and scaling

Was this topic useful?