Container monitoring and logging
The Splunk App for Data Science and Deep Learning (DSDL) leverages external containers for computationally intensive tasks. Monitoring these containers is crucial for debugging, operational awareness, seamless model development, and a stable production environment.
Learn about collecting logs, capturing performance metrics, automatically instrumenting containers with OpenTelemetry, and surfacing container health in the Splunk platform or Splunk Observability Cloud.
Overview
When you run the fit
or apply
commands in DSDL, a Docker, Kubernetes, or OpenShift container is spun up to execute model training or inference.
Running Splunk Observability includes the following to help inform container health:
- Collecting container logs (stdout/stderr, custom logs) in Splunk to debug errors or confirm job completion.
- Capturing performance metrics (CPU, memory, GPU usage) in real-time for diagnosing slow jobs or resource bottlenecks.
- Automatically enabling OpenTelemetry instrumentation for container endpoints in the Splunk Observability Suite (if toggled in DSDL's Observability Settings).
- Ensuring enterprise-level reliability by setting up dashboards, alerts, or even autoscaling triggers based on these metrics.
DSDL includes the following logs and telemetry data to help inform container health:
- Splunk _internal logs about container management.
- Container stdout/stderr (ML library messages, Python prints).
- Custom logs or metrics you send to Splunk HEC.
- (Optional) OpenTelemetry data automatically sent to Splunk Observability once you enable Observability in the DSDL Setup.
Container logs in the Splunk platform
Review the following for descriptions of the container logs provided in the Splunk platform.
MLTK container logs
The MLTK container logs of _internal
generate when DSDL tries to start or stop a container, or hits network issues. These logs are stored in the _internal
index with "mltk-container" in the messages.
Example:
index=_internal "mltk-container"
If containers fail to launch, these logs show network errors or Docker or Kubnernetes API rejections. Repeated connection attempts can indicate firewall or TLS misconfigurations.
Automatic container logs with the Splunk REST API
Once a container is successfully deployed, DSDL automatically collects logs through the Splunk REST API. Automatic container logs are useful for quick debugging or reviewing final outputs when a machine learning job is complete.
If the container fails before initialization, automatic container logs might not appear. Check _internal
logs instead.
You can view the logs by navigating in DSDL to Configuration, then Containers, and then selecting the container name. For example __DEV__
.
The selected container page shows the following details:
- Container Controls: The container image, cluster target, GPU runtime, and container mode of DEV or PROD.
- Container Details: Key-value pairs for
api_url
,runtime
,mode
, and others. - Container Logs: A table or search result similar to the following:
| rest splunk_server=local services/mltk-container/logs/<container_name> | eval _time = strptime(_time,"%Y-%m-%dT%H:%M:%S.%9N") | sort - _time | fields - splunk_server
This table surfaces the real-time logs captured from the container once it's fully deployed.
Custom Python logging
If your notebook code logs with print(...)
or with the Python logging module, logs store in the stdout or stderr container.
HPC tasks with high verbosity can produce large volumes of logs. Consider limiting to INFO or WARN.
Resource metrics
Tracking resource usage can help identify if your training jobs are hitting resource bottlenecks or if you need better scheduling or bigger node types.
See the following table for what metrics are available by container provider:
Container | Description |
---|---|
Docker | Use docker stats for ephemeral checks, or a cAdvisor-based approach to forward metrics into the Splunk platform or Splunk Observability.
|
Kubernetes or OpenShift | Use the Kubernetes metrics API or Splunk Connect for Kubernetes for CPU, memory, and node metrics.
GPU usage requires the NVIDIA device plugin or GPU operator. |
You can also set up alerts for abnormal usage patterns or container crash loops, improving reliability.
Automatic OpenTelemetry instrumentation
OpenTelemetry instrumentation can provide advanced insights into container endpoint usage, request durations, and data flow, for HPC or microservices-based machine learning pipelines.
Once you turn on Splunk Observability, DSDL can automatically instrument container endpoints with OpenTelemetry.
Complete the following steps:
- In DSDL go to Setup, and then Observability Settings.
- Select to Yes to turn on Observability.
- Complete the required fields:
- Splunk Observability Access Token: Add your Observability ingest token.
- Open Telemetry Endpoint: For example,
https://ingest.eu0.signalfx.com
. - Open Telemetry Servicename: For example,
dsdl
.
- Save your changes
Upon completion, all container endpoints including training and inference calls, generate Otel traces. These traces automatically store in the Splunk Observability Suite for deeper analysis, including request latency and container-level CPU and memory correlation.
Instrumentation is done automatically. You do not need to manually set up Otel libraries in your notebook code.
Sending model or training logs to the Splunk platform
Review the following options for sending model or training logs to the Splunk platform as a DSDL user.
Splunk HEC in DSDL
The Splunk HTTP Event Collector (HEC) option in DSDL lets you vew partial results or step-by-step logs. You can combine HEC with container logs for full details.
In DSDL, navigate to the Setup page and provide your HEC token in the Splunk HEC Settings panel. Save your changes.
In your notebook code, use the following:
from dsdlsupport import SplunkHEC hec = SplunkHEC.SplunkHEC() hec.send({'event': {'message': 'operation done'}, 'time': 1692812611})
Logging epoch metrics
You can use epoch metrics to visualize model training progression in near real time.
Example
def fit(model, df, param): for epoch in range(10): # ... hec.send({ 'event': { 'epoch': epoch, 'loss': 0.1234, 'status': 'in_progress' } }) return {"message": "model trained"}
In Splunk you can use the following as an example of how to view epoch metrics:
index=ml_logs status=in_progress | timechart avg(loss) by epoch
Example workflow
The following is an example workflow for monitoring container health:
- The Docker or Kubernetes environment is set up with Splunk Connect or a Splunk Observability agent.
- Container launched:
| fit MLTKContainer algo=...
- Once the container is launched DSDL automatically collects logs. In DSDL go to Configuration, then Containers, then <container_name> to see container details and logs.
- If Observability is enabled, container endpoints generate Otel traces. CPU and memory metrics flow to Splunk Observability.
- If DSDL calls
hec.send(...)
, partial training logs appear in Splunk. - All data including logs, traces, and metrics correlates for a 360° view.
Container monitoring guidelines
Consider the following guidelines when implementing container monitoring:
- Limit log verbosity: High-performance computing (HPC) tasks can produce large logs. Use moderate logging levels.
- Check
_internal
: For container startup or firewall issues, search_internal "mltk-container"
. - Secure Observability: Use transport layer security (TLS) for container endpoints, and secure tokens for Splunk Observability.
- Combine with container management: For concurrency, GPU usage, or development or production containers, see Documentation:DSDL:User:ContainerManagementContainer management and scaling.
Troubleshooting container monitoring
See the following issues you might experience with container monitoring and how to resolve them:
Issue | How to troubleshoot |
---|---|
Container fails to launch | Likely caused by Docker or Kubernetes being unreachable, with a firewall setting blocking the management port. Check the |
Observability is toggled on, but no Otel traces appear | Likely caused by an incorrect Observability token, endpoint, or the container configuration is not updated. In DSDL go to Setup then Observability Settings and confirm a valid token and endpoints. Ensure the container restarts if you make any changes. |
HPC tasks with large logs flooding Splunk | Likely caused by overly verbose training prints or debug mode. Switch to |
GPU usage not recognized in Observability dashboards | Likely caused by a missing NVIDIA device plugin or GPU operator in Kubernetes or OpenShift. Check node labeling, device plugin logs, and GPU operator deployment. |
HEC events missing in Splunk | Likely caused by the wrong HEC token, disabled HEC, or an endpoint mismatch. Check the |
Extend the Splunk App for Data Science and Deep Learning with custom notebooks | Container management and scaling |
This documentation applies to the following versions of Splunk® App for Data Science and Deep Learning: 5.2.1
Feedback submitted, thanks!