Advanced HPC and GPU usage
The Splunk App for Data Science and Deep Learning (DSDL) integrates with containerized machine learning. DSDL supports graphics processing unit (GPU) acceleration, large-scale, high-performance computing (HPC) clusters, and distributed training in Docker, Kubernetes, or OpenShift.
Learn how to optimize HPC workflows, including multi-GPU usage, node labeling, ephemeral volumes, and typical HPC environment considerations.
Overview
When you run the fit
or apply
commands with DSDL, the external container environment can tap into the following high-performance computing resources:
- GPUs for deep learning acceleration.
- Multi-node HPC clusters for distributed training or parallel inference.
- Custom scheduling policies to handle concurrency.
DSDL provides a Splunk platform based orchestration of data flows. HPC nodes perform the heavy machine learning tasks while the Splunk platform manages search, scheduling, and logs. DSDL also offers container images with advanced libraries such as TensorFlow and PyTorch, with optional GPU support.
Using DSDL means you can turn on iterative development with development containers, while production HPC containers run minimal overhead.
HPC and GPU requirements
You must meet the following requirements to use advanced HPC and GPU in DSDL:
Requirement | Details |
---|---|
HPC cluster | Kubernetes or OpenShift typically orchestrates HPC resources. Alternatively, single-host Docker HPC can run multi-GPU servers. |
GPU drivers and libraries | NVIDIA GPU nodes typically need the NVIDIA device plugin in Kubernetes or OpenShift. On Docker single-host, you must install |
HPC network and storage | HPC nodes typically use high-bandwidth interconnects such as InfiniBand, 10/40/100GbE. DSDL containers can still rely on ephemeral volumes for short data staging, but typically HPC contexts need persistent or parallel file systems such as NFS or GlusterFS. |
Multi-GPU and node labeling
See the following for descriptions of multi-GPU and node labeling in DSDL.
GPU resource requests
In Kubernetes or OpenShift, define GPU requests in your pod or deployment specification as shown in the following example:
resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1
DSDL automatically uses the GPU if your container image includes GPU libraries such as golden-gpu. The Splunk platform calls | fit MLTKContainer ...
to schedule pods with nvidia.com/gpu:1
.
Node labeling and taints
HPC clusters might label GPU nodes such as gpu=true
, or add taints so that only GPU workloads land on them.
In DSDL, set the appropriate nodeSelector or tolerations in your container environment and Kubernetes cluster configuration if you want certain tasks only on GPU nodes.
Single and multi-GPU tasks
For single-GPU tasks, request nvidia.com/gpu:1
.
For multi-GPU tasks such as distributed data parallel training, request multiple GPUs in your Pod spec or define a ReplicaSet with multiple GPU pods. This is more advanced and requires your notebook code such as PyTorch DDP, or Horovod, to handle multi-GPU scaling.
Distributed training approaches
Review the following distributed training approaches for HPU and GPU usage:
Training approach | Description |
---|---|
Single container, multiple GPUs | This is the simplest approach for HPC. A single container sees multiple GPUs. Within your notebook code, use PyTorch's DataParallel or TensorFlow's MirroredStrategy to leverage multiple GPUs in one host. |
Multi-container, inter-node communication | For large HPC or multi-node distributed training, you might spawn multiple containers or pods, that communicate through MPI or PyTorch's torch.distributed package.DSDL does not manage multi-node orchestration out of the box but it can be set up with your HPC scheduler or an advanced Kubernetes operator. |
DSDL integration | In advanced HPC workflows, you can still call the fit or apply commands, but your notebook code must handle the distributed logic.HPC node ephemeral logs or partial metrics can still route to Splunk though HEC or container logs. |
About development and production HPC containers
See the following for some key points about development and production HPC containers.
Development HPC containers
- JupyterLab plus GPU libraries.
- Lets data scientists refine code on a single HPC node with 1 to 2 GPUs or smaller data subsets.
- Potentially ephemeral as once the development session ends, the container is stopped.
Production HPC containers
- Minimal overhead as it uses no Jupyter or development tools.
- Can run multi-GPU or multi-node distributed tasks.
- Used by scheduled searches or repeated inference jobs in the Splunk platform.
- Must define GPU resource requests if you need GPU acceleration.
Splunk Observability and HPC
See the following for descriptions of Splunk Observability and HPC.
HPC monitoring
HPC clusters typically ship with node-level metrics such as Ganglia or Prometheus. You can forward these metrics to the Splunk platform or Splunk Observability to unify HPC usage with container-level insights.
GPU telemetry
For GPU usage, consider NVIDIA DCGM or device plugin exporters that feed GPU metrics into Splunk Observability. If you turn on Splunk Observability in DSDL, each container endpoint can be automatically instrumented, though HPC multi-node training might require custom tracing logic in your code.
Security and governance
See the following for descriptions of security and governance options available for HPC and GPU usage:
Option | Description |
---|---|
Container registry and minimal GPU images | HPC clusters typically have a local Docker registry. You can build or pull GPU images such as golden-gpu and then push them to your HPC registry.
Use minimal or specialized images to reduce overhead. And air-gapped DSDLsSetup might apply if HPC has no external net. See Install and configure the Splunk App for Data Science and Deep Learning in an air-gapped environment. |
Role-based access to GPU nodes | In Kubernetes and OpenShift, use RBAC or taints and tolerations so that only power users, or HPC roles, can schedule GPU containers. In Docker single-host HPC, you must rely on local user constraints or Docker group membership. |
Automatic notebook sync | DSDL design means that HPC ephemeral volumes are not at risk of losing code. For HPC operators you can treat ephemeral container usage as stateless, letting the Splunk platform manage the notebooks. |
Example: HPC workflow
The following is an example of a high-performance computing (HPC) workflow:
- Create a cluster: Create a Kubernetes HPC cluster with GPU nodes labeled as
gpu=true
. - Create a container image: Create a custom-built
my-gpu-image
with frameworks and libraries such as Torch, CUDA, and cuDNN. - Check the images.conf file: Make sure the file points to
myregistry.local/my-gpu-image:latest
. - Complete the DSDL setup fields:
- Container type: GPU runtime.
- Resource requests:
nvidia.com/gpu:1
. - Container mode: DEV or PROD.
- Run the following search:
index=my_data | fit MLTKContainer algo=my_gpu_notebook features_* into app:MyHPCModel
- Kubernetes schedules a pod on a GPU node. The container loads your code, trains a PyTorch model, streams logs or partial metrics to the Splunk platform.
- The model is now available in DSDL as
app:MyHPCModel
.HPC ephemeral volumes are irrelevant since the code and final artifacts are synced back to DSDL.
Troubleshooting HPC and GPU usage
See the following table for issues you might experience and how to resolve them:
Issue | Cause | How to investigate |
---|---|---|
Container never schedules on GPU node. | You might be missing the nvidia.com/gpu:1 request, or HPC node is not labeled for GPU.
|
Check your Kubernetes pod spec or the images.conf file to confirm HPC node labeling is gpu=true . Or check the device plugin.
|
Multi-GPU training fails silently. | Notebook code is not configuring multi-GPU, or is missing distribution logic. | HPC logs in containers of stdout and stderr . See if PyTorch DataParallel or multi-node configuration is correct.
|
Docker single-host HPC container sees no GPUs. | You might not be using --gpus all or runtime=nvidia for Docker run commands.
|
Check the Docker CLI usage or logs for "no GPU devices found" errors. |
HPC cluster can't pull the GPU image. | Private registry authentication error or air-gapped images might be missing. | Re-check credentials or ensure you loaded .TAR files on the HPC node registry. |
HPC ephemeral volumes losing code in notebooks. | DSDL sync scripts or configuration might be failing. | Check _internal "mltk-container" logs for sync errors.
The Splunk platform automatically persists notebooks, so ephemeral is acceptable. |
Performance tuning and handling large datasets | Extend the Splunk App for Data Science and Deep Learning with custom notebooks |
This documentation applies to the following versions of Splunk® App for Data Science and Deep Learning: 5.2.1
Feedback submitted, thanks!