Performance tuning and handling large datasets

The Splunk App for Data Science and Deep Learning (DSDL) lets you combine the power of the Splunk platform search with container-based machine learning workloads. But if you don't effectively manage datasets with millions or billions of events, that can push container memory and CPU usage to the limit and affect your costs and app performance.

When you run the fit or apply commands on multiple terabytes of data in the Splunk platform, the container environment must handle that data in memory or stream it in a distributed manner. This can lead to the following issues:

Container memory overruns if the container tries to load a large DataFrame at once.
Time-outs if the job exceeds the maximum search or container run time.
Excessive CPU usage by over-taxing the HPC or container node due to no data sampling or partitioning.

Review best practices for performance tuning, data partitioning, data preprocessing, sampling strategies, and resource configuration in Docker, Kubernetes, or OpenShift when working with large datasets.

Data filtering and partitioning

Review the options to filter and partition your data.

Use SPL to filter and summarize data

You can filter your data with SPL before sending it to the container as shown in the following example:

index=your_big_index
| search eventtype=anomalies source="some_path"
| stats count by user
| fit MLTKContainer ...

Aggregate or summarize large logs if you only need aggregated features. For time-series, consider summarizing by the minute or hour.

Running the fit command on raw events can cause memory blow-ups. Follow best practices for data preparation.

Use data splitting or partitioning

For large training sets, partition them in multiple Splunk platform searches or chunked intervals:

Index the data as shown in the following example:

index=your_big_index earliest=-30d latest=-15d
| fit MLTKContainer mode=stage ...

Then index another chunked interval from latest=-15d to present.
If your algorithm or code supports incremental or partial training, your notebook can handle partial merges or checkpointing.

Container resource tuning

Review the following options for container resource tuning.

CPU and memory requests

In Kubernetes or OpenShift, set resources.requests.memory and resources.limits.memory to a higher amount if you anticipate large in-memory DataFrames.

Example:

resources:
  requests:
    memory: "4Gi"
  limits:
    memory: "16Gi"

In Docker single-host setups, pass --memory 16g --cpus 4 to limit memory to 16GB and 4 CPU cores. Adjust to your usage.

GPU considerations

If your algorithm relies on GPUs, ensure GPU resource requests with nvidia.com/gpu:1. GPU usage helps with large neural network training, but your code must be optimized for multi-GPU or HPC if you exceed the capacity of a single GPU.

Development and production containers

Development containers might only need a small memory limit for iterative notebook coding with a sample dataset. Production containers often require more robust resource allocations if the final dataset is large. Plan your HPC node or Docker host capacity accordingly.

Data sampling and splitting with SPL

Review methods by which you can use SPL for data sampling and data splitting.

Use the sample command

Using the ML-SPL sample command as follows can randomly downsample events to 10,000, giving a quick dataset for development or prototyping:
| sample 10000
For final model training, remove or reduce sample command usage, or use partial sampling as shown in the following example:
| sample partitions=10 seed=42 | where partition_number<7

Partitioning large data

You can partition data using sample partitions=N or JSON modulo operator.
If your code supports incremental training, you can feed in each partition separately as shown in the following example:

index=your_big_index
| sample partitions=10 seed=42
| where partition_number < 8
| fit MLTKContainer algo=...

Then combine or continue training with other partitions in separate runs.

Data summaries

For extremely large sets of raw data, you can perform an initial stats or timechart command to reduce the data volume as shown in the following example:

index=your_big_index
| stats avg(value) as avg_value, count by some_field
| fit MLTKContainer ...

This approach is useful l for time-series or aggregator-based machine learning tasks.

Managing memory-intensive code

Consider the following options when trying to manage memory-intensive machine learning code.

Option	Description
In-Notebook chunking	If your code reads the entire DataFrame at once, consider chunking inside the notebook. For example, using `pandas.read_csv(..., chunksize=100000)` if you load data from a .CSV. For multi-million row data, libraries like Dask, Vaex, or Spark in your container might help and can handle out-of-core operations.
Partial-fit or streaming algorithms	Some scikit-learn or River libraries support `partial_fit` for incremental learning. If you define `partial_fit` logic in your notebook, you can stage data chunks one-by-one from the Splunk platform.
HPC and distributed approaches	For truly massive data, you can perform distributed training with Spark, PyTorch DDP, or Horovod. DSDL can start the container, but you handle multi-node distribution. Alternatively, rely on a separate HPC job manager to orchestrate multi-node training, then push final results back into the Splunk platform.

Managing timeout and resource limits

Review the following table for options to manage your timeout and resource limits:

Option	Details
Max Search Runtime	By default, the Splunk platform can kill searches that exceed certain CPU or wall-clock times. You can increase or remove that limit if your training or inference is known to be long. Be mindful if you are increasing from the default limits to avoid causing issues on your search head.
DSDL Container Max Time	In DSDL, you can configure a maximum model runtime or idle kill threshold. If you expect a multi-hour HPC training job, increase these timeouts so the model doesn't terminate prematurely.
HPC Queue or Splunk Scheduler	If HPC usage is managed by a queue system such as Slurm, you might want to orchestrate jobs outside of the Splunk platform scheduling. Alternately you can set extended Splunk search timeouts so HPC tasks can complete properly.

Splunk Observability for large jobs

Review this table for options using Splunk Observability for large, data intensive jobs:

Option	Details
Resource metrics	For large workloads, track container CPU and memory or GPU usage in Splunk Observability or `_internal` logs. If usage spikes or the container hits OOM kills, you'll see it in container logs or HPC logs.
Step-by-step logging	If your training takes hours, consider streaming intermediate logs to Splunk HEC, or partial logs in `stdout`, so you can see progress. Splunk or Observability-based alerts can notify you if usage patterns deviate from normal. For example, memory climbing unexpectedly.

Example: Large dataset workflow

The following is an example of a large dataset workflow:

Data preparation: The following example code summarizes data before sending it to the container.

index=huge_data
| search user_type=customer
| stats count, avg(metric) as avg_metric by user
| fit MLTKContainer algo=big_notebook into app:BigModel

Create the container: For Docker or Kubernetes choose 16GB memory, 4 CPU, possibly 1 GPU.
The code in your notebook uses scikit-learn or PyTorch with chunked data reading as needed.
Long running: The Splunk platform might only hold the search open for 2 hours. If so, set max time or rely on HPC queue outside of Splunk.
Container logs or partial metrics appear in _internal or the container logs index.
Save model: The model is saved as app:BigModel. HPC ephemeral volumes are irrelevant because DSDL syncs the final artifacts to Splunk.

Troubleshooting performance tuning and large datasets

See the following table for issues you might experience and how to resolve them:

Issue	Cause	How to investigate
Container hits OOM kill mid-training.	The dataset is too large or the container memory limit is too low.	Increase memory requests in Kubernetes or Docker `--memory`. Reduce data size using SPL, or chunk data.
Splunk kills the search after N minutes.	The Splunk platform default search timeout, or MLTK container maximum run time, is too small.	Adjust the `max_search_runtime` or the container idle kill threshold in DSDL under Setup.
HPC cluster node runs out of GPU memory.	The model or data batch size is too large on the GPU.	Adjust your code to reduce batch size, or use smaller model layers, or move to multi-GPU.
You see `"RuntimeError: CUDNN_STATUS_ALLOC_FAILED" in container logs.`	You are out-of-memory on GPU, or there is another resource conflict.	Check the container logs and consider a smaller batch size. You can also re-check HPC job scheduling if multiple GPU tasks are overlapping.
Partial data is loaded, but the container never finishes fit.	Might be insufficient filtering or no chunking method in your code, leading to a large data load.	Use SPL to summarize or chunk data. Consider adding `partial_fit` logic in the notebook.

Performance tuning and handling large datasets

Data filtering and partitioning

Use SPL to filter and summarize data

Use data splitting or partitioning

Container resource tuning

CPU and memory requests

GPU considerations

Development and production containers

Data sampling and splitting with SPL

Use the sample command

Partitioning large data

Data summaries

Managing memory-intensive code

Managing timeout and resource limits

Splunk Observability for large jobs

Example: Large dataset workflow

Troubleshooting performance tuning and large datasets

Comments

Performance tuning and handling large datasets

Was this topic useful?