Advanced HPC and GPU usage

The Splunk App for Data Science and Deep Learning (DSDL) integrates with containerized machine learning. DSDL supports graphics processing unit (GPU) acceleration, large-scale, high-performance computing (HPC) clusters, and distributed training in Docker, Kubernetes, or OpenShift.

Learn how to optimize HPC workflows, including multi-GPU usage, node labeling, ephemeral volumes, and typical HPC environment considerations.

Overview

When you run the fit or apply commands with DSDL, the external container environment can tap into the following high-performance computing resources:

GPUs for deep learning acceleration.
Multi-node HPC clusters for distributed training or parallel inference.
Custom scheduling policies to handle concurrency.

DSDL provides a Splunk platform based orchestration of data flows. HPC nodes perform the heavy machine learning tasks while the Splunk platform manages search, scheduling, and logs. DSDL also offers container images with advanced libraries such as TensorFlow and PyTorch, with optional GPU support.

Using DSDL means you can turn on iterative development with development containers, while production HPC containers run minimal overhead.

HPC and GPU requirements

You must meet the following requirements to use advanced HPC and GPU in DSDL:

Requirement	Details
HPC cluster	Kubernetes or OpenShift typically orchestrates HPC resources. Alternatively, single-host Docker HPC can run multi-GPU servers. For multi-node HPC with Slurm or other schedulers, you can wrap the container environment inside those HPC job scripts.
GPU drivers and libraries	NVIDIA GPU nodes typically need the NVIDIA device plugin in Kubernetes or OpenShift. On Docker single-host, you must install `nvidia-docker2` or the newer `--gpus` runtime. CUDA and cuDNN libraries, and other frameworks must be included in your container if you're doing deep learning.
HPC network and storage	HPC nodes typically use high-bandwidth interconnects such as InfiniBand, 10/40/100GbE. DSDL containers can still rely on ephemeral volumes for short data staging, but typically HPC contexts need persistent or parallel file systems such as NFS or GlusterFS. DSDL will continue to sync notebooks and models to the Splunk platform, mitigating ephemeral volume losses.

Multi-GPU and node labeling

See the following for descriptions of multi-GPU and node labeling in DSDL.

GPU resource requests

In Kubernetes or OpenShift, define GPU requests in your pod or deployment specification as shown in the following example:

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

DSDL automatically uses the GPU if your container image includes GPU libraries such as golden-gpu. The Splunk platform calls | fit MLTKContainer ... to schedule pods with nvidia.com/gpu:1.

Node labeling and taints

HPC clusters might label GPU nodes such as gpu=true, or add taints so that only GPU workloads land on them.
In DSDL, set the appropriate nodeSelector or tolerations in your container environment and Kubernetes cluster configuration if you want certain tasks only on GPU nodes.

Single and multi-GPU tasks

For single-GPU tasks, request nvidia.com/gpu:1.
For multi-GPU tasks such as distributed data parallel training, request multiple GPUs in your Pod spec or define a ReplicaSet with multiple GPU pods. This is more advanced and requires your notebook code such as PyTorch DDP, or Horovod, to handle multi-GPU scaling.

Distributed training approaches

Review the following distributed training approaches for HPU and GPU usage:

Training approach	Description
Single container, multiple GPUs	This is the simplest approach for HPC. A single container sees multiple GPUs. Within your notebook code, use PyTorch's DataParallel or TensorFlow's MirroredStrategy to leverage multiple GPUs in one host. No changes to Splunk search logic needed as DSDL considers it as one container.
Multi-container, inter-node communication	For large HPC or multi-node distributed training, you might spawn multiple containers or pods, that communicate through MPI or PyTorch's `torch.distributed` package. DSDL does not manage multi-node orchestration out of the box but it can be set up with your HPC scheduler or an advanced Kubernetes operator. Searching or scheduling in the Splunk platform triggers the job, but the container's own code handles multi-node communication.
DSDL integration	In advanced HPC workflows, you can still call the `fit` or `apply` commands, but your notebook code must handle the distributed logic. HPC node ephemeral logs or partial metrics can still route to Splunk though HEC or container logs.

About development and production HPC containers

See the following for some key points about development and production HPC containers.

Development HPC containers

JupyterLab plus GPU libraries.
Lets data scientists refine code on a single HPC node with 1 to 2 GPUs or smaller data subsets.
Potentially ephemeral as once the development session ends, the container is stopped.

Production HPC containers

Minimal overhead as it uses no Jupyter or development tools.
Can run multi-GPU or multi-node distributed tasks.
Used by scheduled searches or repeated inference jobs in the Splunk platform.
Must define GPU resource requests if you need GPU acceleration.

Splunk Observability and HPC

See the following for descriptions of Splunk Observability and HPC.

HPC monitoring

HPC clusters typically ship with node-level metrics such as Ganglia or Prometheus. You can forward these metrics to the Splunk platform or Splunk Observability to unify HPC usage with container-level insights.

GPU telemetry

For GPU usage, consider NVIDIA DCGM or device plugin exporters that feed GPU metrics into Splunk Observability. If you turn on Splunk Observability in DSDL, each container endpoint can be automatically instrumented, though HPC multi-node training might require custom tracing logic in your code.

Security and governance

See the following for descriptions of security and governance options available for HPC and GPU usage:

Option	Description
Container registry and minimal GPU images	HPC clusters typically have a local Docker registry. You can build or pull GPU images such as `golden-gpu` and then push them to your HPC registry. Use minimal or specialized images to reduce overhead. And air-gapped DSDLsSetup might apply if HPC has no external net. See Install and configure the Splunk App for Data Science and Deep Learning in an air-gapped environment.
Role-based access to GPU nodes	In Kubernetes and OpenShift, use RBAC or taints and tolerations so that only power users, or HPC roles, can schedule GPU containers. In Docker single-host HPC, you must rely on local user constraints or Docker group membership.
Automatic notebook sync	DSDL design means that HPC ephemeral volumes are not at risk of losing code. For HPC operators you can treat ephemeral container usage as stateless, letting the Splunk platform manage the notebooks.

Example: HPC workflow

The following is an example of a high-performance computing (HPC) workflow:

Create a cluster: Create a Kubernetes HPC cluster with GPU nodes labeled as gpu=true.
Create a container image: Create a custom-built my-gpu-image with frameworks and libraries such as Torch, CUDA, and cuDNN.
Check the images.conf file: Make sure the file points to myregistry.local/my-gpu-image:latest.
Complete the DSDL setup fields:
1. Container type: GPU runtime.
2. Resource requests: nvidia.com/gpu:1.
3. Container mode: DEV or PROD.

Run the following search:

index=my_data
| fit MLTKContainer algo=my_gpu_notebook features_* into app:MyHPCModel

Kubernetes schedules a pod on a GPU node. The container loads your code, trains a PyTorch model, streams logs or partial metrics to the Splunk platform.
The model is now available in DSDL as app:MyHPCModel.
HPC ephemeral volumes are irrelevant since the code and final artifacts are synced back to DSDL.

Troubleshooting HPC and GPU usage

See the following table for issues you might experience and how to resolve them:

Issue	Cause	How to investigate
Container never schedules on GPU node.	You might be missing the `nvidia.com/gpu:1` request, or HPC node is not labeled for GPU.	Check your Kubernetes pod spec or the images.conf file to confirm HPC node labeling is `gpu=true`. Or check the device plugin.
Multi-GPU training fails silently.	Notebook code is not configuring multi-GPU, or is missing distribution logic.	HPC logs in containers of `stdout` and `stderr`. See if PyTorch DataParallel or multi-node configuration is correct.
Docker single-host HPC container sees no GPUs.	You might not be using `--gpus all` or `runtime=nvidia` for Docker run commands.	Check the Docker CLI usage or logs for "no GPU devices found" errors.
HPC cluster can't pull the GPU image.	Private registry authentication error or air-gapped images might be missing.	Re-check credentials or ensure you loaded .TAR files on the HPC node registry.
HPC ephemeral volumes losing code in notebooks.	DSDL sync scripts or configuration might be failing.	Check `_internal "mltk-container"` logs for sync errors. The Splunk platform automatically persists notebooks, so ephemeral is acceptable.

Related answers from Splunk Community

Advanced HPC and GPU usage

Overview

HPC and GPU requirements

Multi-GPU and node labeling

GPU resource requests

Node labeling and taints

Single and multi-GPU tasks

Distributed training approaches

About development and production HPC containers

Development HPC containers

Production HPC containers

Splunk Observability and HPC

HPC monitoring

GPU telemetry

Security and governance

Example: HPC workflow

Troubleshooting HPC and GPU usage

Comments

Advanced HPC and GPU usage

Was this topic useful?