Container monitoring and logging

The Splunk App for Data Science and Deep Learning (DSDL) leverages external containers for computationally intensive tasks. Monitoring these containers is crucial for debugging, operational awareness, seamless model development, and a stable production environment.

Learn about collecting logs, capturing performance metrics, automatically instrumenting containers with OpenTelemetry, and surfacing container health in the Splunk platform or Splunk Observability Cloud.

Overview

When you run the fit or apply commands in DSDL, a Docker, Kubernetes, or OpenShift container is spun up to execute model training or inference.

Running Splunk Observability includes the following to help inform container health:

Collecting container logs (stdout/stderr, custom logs) in Splunk to debug errors or confirm job completion.
Capturing performance metrics (CPU, memory, GPU usage) in real-time for diagnosing slow jobs or resource bottlenecks.
Automatically enabling OpenTelemetry instrumentation for container endpoints in the Splunk Observability Suite (if toggled in DSDL's Observability Settings).
Ensuring enterprise-level reliability by setting up dashboards, alerts, or even autoscaling triggers based on these metrics.

DSDL includes the following logs and telemetry data to help inform container health:

Splunk _internal logs about container management.
Container stdout/stderr (ML library messages, Python prints).
Custom logs or metrics you send to Splunk HEC.
(Optional) OpenTelemetry data automatically sent to Splunk Observability once you enable Observability in the DSDL Setup.

Container logs in the Splunk platform

Review the following for descriptions of the container logs provided in the Splunk platform.

MLTK container logs

The MLTK container logs of _internal generate when DSDL tries to start or stop a container, or hits network issues. These logs are stored in the _internal index with "mltk-container" in the messages.

Example:

index=_internal "mltk-container"

If containers fail to launch, these logs show network errors or Docker or Kubnernetes API rejections. Repeated connection attempts can indicate firewall or TLS misconfigurations.

Automatic container logs with the Splunk REST API

Once a container is successfully deployed, DSDL automatically collects logs through the Splunk REST API. Automatic container logs are useful for quick debugging or reviewing final outputs when a machine learning job is complete.

If the container fails before initialization, automatic container logs might not appear. Check _internal logs instead.

You can view the logs by navigating in DSDL to Configuration, then Containers, and then selecting the container name. For example __DEV__.

The selected container page shows the following details:

Container Controls: The container image, cluster target, GPU runtime, and container mode of DEV or PROD.
Container Details: Key-value pairs for api_url, runtime, mode, and others.

Container Logs: A table or search result similar to the following:

| rest splunk_server=local services/mltk-container/logs/<container_name>
| eval _time = strptime(_time,"%Y-%m-%dT%H:%M:%S.%9N")
| sort - _time
| fields - splunk_server

This table surfaces the real-time logs captured from the container once it's fully deployed.

Custom Python logging

If your notebook code logs with print(...) or with the Python logging module, logs store in the stdout or stderr container.

HPC tasks with high verbosity can produce large volumes of logs. Consider limiting to INFO or WARN.

Resource metrics

Tracking resource usage can help identify if your training jobs are hitting resource bottlenecks or if you need better scheduling or bigger node types.

See the following table for what metrics are available by container provider:

Container	Description
Docker	Use `docker stats` for ephemeral checks, or a cAdvisor-based approach to forward metrics into the Splunk platform or Splunk Observability.
Kubernetes or OpenShift	Use the Kubernetes metrics API or Splunk Connect for Kubernetes for CPU, memory, and node metrics. GPU usage requires the NVIDIA device plugin or GPU operator.

You can also set up alerts for abnormal usage patterns or container crash loops, improving reliability.

Automatic OpenTelemetry instrumentation

OpenTelemetry instrumentation can provide advanced insights into container endpoint usage, request durations, and data flow, for HPC or microservices-based machine learning pipelines.

Once you turn on Splunk Observability, DSDL can automatically instrument container endpoints with OpenTelemetry.

Complete the following steps:

In DSDL go to Setup, and then Observability Settings.
Select to Yes to turn on Observability.
Complete the required fields:
1. Splunk Observability Access Token: Add your Observability ingest token.
2. Open Telemetry Endpoint: For example, https://ingest.eu0.signalfx.com.
3. Open Telemetry Servicename: For example, dsdl.
Save your changes

Upon completion, all container endpoints including training and inference calls, generate Otel traces. These traces automatically store in the Splunk Observability Suite for deeper analysis, including request latency and container-level CPU and memory correlation.

Instrumentation is done automatically. You do not need to manually set up Otel libraries in your notebook code.

Sending model or training logs to the Splunk platform

Review the following options for sending model or training logs to the Splunk platform as a DSDL user.

Splunk HEC in DSDL

The Splunk HTTP Event Collector (HEC) option in DSDL lets you vew partial results or step-by-step logs. You can combine HEC with container logs for full details.

In DSDL, navigate to the Setup page and provide your HEC token in the Splunk HEC Settings panel. Save your changes.

In your notebook code, use the following:

from dsdlsupport import SplunkHEC
hec = SplunkHEC.SplunkHEC()
hec.send({'event': {'message': 'operation done'}, 'time': 1692812611})

Logging epoch metrics

You can use epoch metrics to visualize model training progression in near real time.

Example

def fit(model, df, param):
    for epoch in range(10):
        # ...
        hec.send({
          'event': {
             'epoch': epoch,
             'loss': 0.1234,
             'status': 'in_progress'
          }
        })
    return {"message": "model trained"}

In Splunk you can use the following as an example of how to view epoch metrics:

index=ml_logs status=in_progress
| timechart avg(loss) by epoch

Example workflow

The following is an example workflow for monitoring container health:

The Docker or Kubernetes environment is set up with Splunk Connect or a Splunk Observability agent.
Container launched:
| fit MLTKContainer algo=...
Once the container is launched DSDL automatically collects logs. In DSDL go to Configuration, then Containers, then <container_name> to see container details and logs.
If Observability is enabled, container endpoints generate Otel traces. CPU and memory metrics flow to Splunk Observability.
If DSDL calls hec.send(...), partial training logs appear in Splunk.
All data including logs, traces, and metrics correlates for a 360° view.

Container monitoring guidelines

Consider the following guidelines when implementing container monitoring:

Limit log verbosity: High-performance computing (HPC) tasks can produce large logs. Use moderate logging levels.
Check _internal: For container startup or firewall issues, search _internal "mltk-container".
Secure Observability: Use transport layer security (TLS) for container endpoints, and secure tokens for Splunk Observability.
Combine with container management: For concurrency, GPU usage, or development or production containers, see Documentation:DSDL:User:ContainerManagementContainer management and scaling.

Troubleshooting container monitoring

See the following issues you might experience with container monitoring and how to resolve them:

Issue	How to troubleshoot
Container fails to launch	Likely caused by Docker or Kubernetes being unreachable, with a firewall setting blocking the management port. Check the `_internal` logs for "mltk-container" for network or authorization errors.
Observability is toggled on, but no Otel traces appear	Likely caused by an incorrect Observability token, endpoint, or the container configuration is not updated. In DSDL go to Setup then Observability Settings and confirm a valid token and endpoints. Ensure the container restarts if you make any changes.
HPC tasks with large logs flooding Splunk	Likely caused by overly verbose training prints or debug mode. Switch to `INFO` or `WARN` level, or only send partial logs with Splunk HTTP Event Collector (HEC).
GPU usage not recognized in Observability dashboards	Likely caused by a missing NVIDIA device plugin or GPU operator in Kubernetes or OpenShift. Check node labeling, device plugin logs, and GPU operator deployment.
HEC events missing in Splunk	Likely caused by the wrong HEC token, disabled HEC, or an endpoint mismatch. Check the `_internal` logs for "Token or HEC" errors. Ensure HEC is enabled and the correct port (443/8088) is open.

Related answers from Splunk Community

Container monitoring and logging

Overview

Container logs in the Splunk platform

MLTK container logs

Automatic container logs with the Splunk REST API

Custom Python logging

Resource metrics

Automatic OpenTelemetry instrumentation

Sending model or training logs to the Splunk platform

Splunk HEC in DSDL

Logging epoch metrics

Example workflow

Container monitoring guidelines

Troubleshooting container monitoring

Comments

Container monitoring and logging

Was this topic useful?