Splunk® App for Data Science and Deep Learning

Use the Splunk App for Data Science and Deep Learning

Container management and scaling

The Splunk App for Data Science and Deep Learning leverages external containers to offload resource-heavy machine learning tasks from the Splunk search head. This architecture isolates potentially large workloads, and also allows for horizontal scaling, GPU acceleration, and robust environment management.

Review the following guidelines to manage container lifecycles, scale concurrency, and optimize resource usage when running DSDL in Docker, Kubernetes, or OpenShift.

Overview

When you run DSDL commands such as | fit MLTKContainer ... or | apply ..., DSDL communicates with an external container platform. DSDL communicates with Docker on a single host, or a cluster orchestrated by Kubernetes or OpenShift.

By default, development (DEV) mode containers include JupyterLab, TensorBoard, and other developer tools. And production (PROD) mode containers are minimal, running just the Python processes required for model training and inference.

Understanding how these containers are launched, monitored, and scaled can help with efficient resource usage and robust enterprise-grade workflows.

Container lifecycle

Review the following descriptions of a container lifecycle:

Lifecycle stage Description
Launch triggers User-initiated: When a user runs an ML-SPL command in the Splunk platform, DSDL checks if a container or pod is already running with the correct configuration. If not, a new one is launched.

The following is an example ML-SPL command: | fit MLTKContainer algo=...

Scheduled searches: If a scheduled search includes the fit or apply commands, DSDL might spin up containers in the background at regular intervals.
Manual start (optional): In DSDL under Configuration and then Containers, you can manually start or stop containers.
This can be helpful for development tasks or when you need to use Jupyter.
Initialization DSDL calls the Docker, Kubernetes, or OpenShift API to create a container or pod based on your chosen image. For example golden-cpu image.

The container environment variables are set in DSDL configuration. Example variables include JUPYTER_PASSWD and container_enable_https.

Active and running Once started, the container is available to handle fit or apply commands.

In DEV mode, JupyterLab or TensorBoard endpoints can be accessed through mapped ports, Routes, or NodePorts.

Idle or stopped After a period of inactivity and depending on your setup, DSDL might automatically stop the container to free resources.
Removal and cleanup Older or crashed containers are removed from the system. On Kubernetes or OpenShift pods are ephemeral by design and typically cleaned up automatically.

About Docker, Kubernetes, and OpenShift

Review the following differences between Docker, Kubernetes, and OpenShift containers in DSDL:

Container Typical use case Network setup Scaling Management
Single-host Docker Smaller dev/test environments or single Splunk instance with Docker on the same machine. By default, the container might be on unix://var/run/docker.sock or tcp://localhost:2375, no TLS. Usually limited to one container or pod per host unless you manually script multiple Docker hosts. DSDL starts and stops the container through the Docker API. Development containers might keep running until manually stopped.
Kubernetes or OpenShift Production-scale or distributed environments. TLS/HTTPS from the Splunk platform to the Kubernetes or OpenShift API server on port 6443. You can define multiple replicas, or rely on Kubernetes Horizontal Pod Autoscaler (HPA) for concurrency. DSDL translates container requests into pod deployments. You can configure resource requests, node labeling (GPU), and advanced security contexts.

Concurrency and scaling patterns

Review the following descriptions of concurrency and scaling patterns in DSDL:

Pattern Description
Horizontal Pod Autoscaler (HPA)

on Kubernetes or OpenShift

HPA can auto-scale pods based on CPU or memory usage.

For DSDL, you might define an HPA that spawns additional pods if usage surpasses certain thresholds.
This helps handle multiple concurrent Splunk searches that call the fit or apply commands simultaneously.

Docker Compose or scripts If you're using single-host Docker, scaling typically involves manually launching multiple containers or writing scripts to do so.

DSDL won't automatically create multiple containers on Docker unless you handle it outside of the Splunk platform.

GPU scheduling For GPU-based containers in Kubernetes or OpenShift, assign a label or request nvidia.com/gpu: 1 so that the container lands on GPU-enabled nodes.

In Docker, ensure --gpus all or similar flags are used if you want GPU acceleration.

About development and production containers

Review the following descriptions of development (DEV) and production (PROD) containers:

Container type Description
DEV containers JupyterLab for interactive notebook development.

TensorBoard or other development tooling might be exposed on additional ports such as 8888 for Jupyter, or 6006 for TensorBoard.
Typically used short-term to refine code, test data staging, or debug advanced logic.

PROD containers Minimal setup, containing only the necessary Python environment for model training and inference.

No Jupyter or development ports exposed. Reduces the attack surface and resource overhead.
Might run on multiple replicas if concurrency is high.

DSDL operations overview

Review this overview of the DSDL operations you can access within the DSDL user interface (UI):

UI component Description
Container dashboard In DSDL and then Configuration, and then Containers.

You can see currently running containers/pods, start/stop them, or open Jupyter/TensorBoard if in dev mode.

Logs and diagnostics Splunk _internal logs with "mltk-container" can reveal container startup errors, e.g., network timeouts or Docker/K8s rejections.

Container logs can be forwarded to Splunk, letting you see Python exceptions from fit or apply.

Cleanup DSDL typically stops idle containers after a certain timeout. This is configurable in the DSDL app.

In Kubernetes and OpenShift, old pods might remain in a "Completed" or "Terminated" state, but cluster housekeeping eventually prunes them.

Resource allocation and scheduling

Review the following options for resource allocation and scheduling in DSDL:

Option Description
CPU and memory requests In Kubernetes, define requests and limits in your pod or deployment specification. This ensures the container won't exceed certain CPU/memory usage.

On Docker, you can specify --cpus or --memory if you manually run containers.

GPU resources On Kubernetes or OpenShift, you must configure GPU node drivers or device plugins such as NVIDIA so that pods requesting nvidia.com/gpu: 1 schedule properly.

On Docker, use the runtime=nvidia setting or an environment variable to link GPU libraries.

Production best practices Keep an eye on ephemeral storage usage. Large data staging or logs can fill ephemeral volumes.

Monitor container logs for OOM killer events, which indicate insufficient memory limits.

Typical container management and scaling use cases

See the following common use cased for container management and scaling:

Use case Desription
Multiple development (DEV) containers Each data scientist spawns a personal DEV container with Jupyter. They eventually merge and store code in Git.
1 production container in Docker Single-host environment with moderate concurrency. The Splunk platform calls 1 container to handle model training and inference sequentially.
Kubernetes high-performance computing (HPC) Large-scale HPC environment with multiple GPU nodes. Kubernetes auto-scales pods so concurrent machine learning tasks each get their own container.
OpenShift Enterprise Container Platform Similar HPC environment but integrated with Red Hat's security or operator frameworks.

Troubleshooting container management and scaling

See the following issues you might experience and how to resolve them:

Issue Resolution
Container fails to start 1. Check _internal logs for mltk-container messages about Docker or Kubernetes API errors or timeouts.

2. Ensure the Docker REST socket or Kubernetes API server is reachable.

Development (DEV) container times out If DEV containers auto-stop, adjust the idle timeout or manually keep them active in the DSDL UI.
Resource exhaustion If logs indicate out of memory (OOM) kills, or CPU throttling, raise the memory or cpu requests in Kubernetes, or refine Docker resource constraints.
GPU not recognized Confirm that you have the correct GPU drivers, device plugins, or runtime=nvidia if using Docker.

Check container logs or _internal for GPU scheduling errors.

Example: Kubernetes multi-pod setup

The following steps are for an example multi-pod setup in Kubernetes:

  1. Configure DSDL: In DSDL go to Setup, then Connect to your Kubernetes cluster.
  2. Define a Deployment: DSDL automatically creates a deployment for dev or prod containers. You can edit the resource specs or add a Horizontal Pod Autoscaler as follows:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: dsdl-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: dsdl-dev
      minReplicas: 1
      maxReplicas: 5
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 70
    
  3. Use DSDL: The Splunk platform calls the fit or apply commands, and DSDL spawns pods. If CPU usage is high, Horizontal Pod Autoscaler (HPA) scales up.
  4. Observe container states: In the Kubernetes dashboard or on the DSDL Container page, you can see how many pods are active.
Last modified on 17 July, 2025
Container monitoring and logging   Advanced container customization

This documentation applies to the following versions of Splunk® App for Data Science and Deep Learning: 5.2.1


Please expect delayed responses to documentation feedback while the team migrates content to a new system. We value your input and thank you for your patience as we work to provide you with an improved content experience!

Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters