Troubleshoot your Splunk Data Stream Processor deployment

Use this information to troubleshoot issues relating to the Splunk Data Stream Processor (DSP) installation and deployment.

Support

To report bugs or receive additional support, do the following:

Ask questions and get answers through community support at Splunk Answers.
If you have a support contract, file a case using the Splunk Support Portal. See Support and Services.
If you have a support contract, contact Splunk Customer Support.
To get professional help with optimizing your Splunk software investment, see Splunk Services.

When contacting Splunk Customer Support, provide the following information:

Information to provide	Notes
Pipeline ID	To view the ID of a pipeline, open the pipeline in DSP, then click the options button and click Update Pipeline Metadata.
Pipeline name	N/A
DSP version	To view your DSP version, in the product UI, click Help & Feedback > About.
DSP diagnostic report	A DSP diagnostic report contains all DSP application logs as well as system and monitoring logs. To generate this report, do the following: Navigate to the working directory of a DSP master node. To identify the master nodes in your DSP cluster, run the `gravity status` command and check the `Cluster nodes` section of the returned information. By default, the name of the working directory is dsp-<version>-linux-amd64, where <version> is the DSP version that you're running. Run the following command: `sudo ./report` The command creates a diagnostic report named dsp-report-<timestamp>.tar.gz in the working directory.
Summary of the problem and any additional relevant information	N/A

[ERROR]: cannot allocate memory

DSP on RHEL7/CENTOS7 fails with a warning message similar to the following:

Warning FailedCreatePodContainer 5s (x2 over 16s) kubelet, 10.234.0.181 unable to ensure pod container exists: failed to create container for [kubepods burstable poded9bd025-c3e4-4ebb-a5b7-2a7adab9742d] : mkdir /sys/fs/cgroup/memory/kubepods/burstable/poded9bd025-c3e4-4ebb-a5b7-2a7adab9742d: cannot allocate memory

Cause

This warning is caused by a bug in the older RHEL7/CENTOS7 kernels such as v3.10.0-1127.19.1.el7 in combination with systemd v 231 or earlier where kernel-memory cgroups are not cleaned up properly. This manifests as a memory allocation error when new pods are created. For more information see the Kubernetes bug report: Kubelet CPU/Memory Usage linearly increases using CronJob.

Solution

Upgrade systemd to v232 or later, or disable kernel memory accounting by setting cgroup.memory=nokmem.

Do the following steps to disable kernel memory accounting:

Find the kernel version.
```
grubby --default-kernel
```
Disable the kernel memory accounting by adding nokmem to the kernel boot parameters.
grubby --args=cgroup.memory=nokmem --update-kernel /boot/<kernel_version>
Reboot the host.

[ERROR]: waiting for agents to join:

You may see this error while running the installer.

Cause

The DSP installer waits five minutes for nodes to join your cluster. If your cluster does not have a minimum of 3 nodes and five minutes have elapsed, then the installer times out.

Solution

You must remove all nodes from Gravity and start the installation process again.

On the master node, run sudo gravity leave --force --confirm to force the node to leave the gravity cluster.
Delete the .gravity directory:
```
sudo rm -rf gravity/.gravity
```
Make sure that you have all three nodes prepared, and then start the installation process over again.

[ERROR]: The following pre-flight checks failed:

The DSP installer fails to complete because of pre-flight checks.

Cause

The DSP installer runs pre-flight checks to make sure that your system meets the minimum requirements required for DSP. If your system does not meet the minimum requirements necessary for DSP, the installer quits installation.

Solution

The installer returns which pre-flight checks failed. Using that information, double-check that you meet the mandatory Hardware and Software requirements for DSP. See Hardware and Software Requirements.

[ERROR]: The following pre-flight checks failed: XXGB available space left on /var/data, minimum of 175GB is required

The DSP installer fails to complete because there isn't enough space left on /var/data even if another disk volume or partition is specified with --location.

Cause

The DSP installer runs a pre-flight check to make sure that your system has enough drive space on /var/data even if you have used --location to install DSP on another drive or partition. If there isn't enough disk space on /var, the pre-flight check fails with a disk space error.

Solution

Add a symlink from your intended install location to /var/data. For example, if you want to use --location /data, then add the following symlink.

ln -s /data /var/data

ImagePullBackoff status shown in Kubernetes after upgrading DSP

After upgrading DSP, Kubernetes reports the status ImagePullBackoff for previously scheduled jobs in the Amazon CloudWatch Metrics, Amazon S3, AWS Metadata, Google Cloud Monitoring, Microsoft 365, and Microsoft Azure Monitor connectors.

Cause

If you do not deactivate all schedules in the Amazon CloudWatch Metrics, Amazon S3, AWS Metadata, Google Cloud Monitoring, Microsoft 365, or Microsoft Azure Monitor connectors before upgrading your DSP deployment, the Kubernetes container image name used by previously scheduled jobs in these connectors is not updated. The scheduled jobs continue to use the previous Kubernetes container image name and the jobs will stall the next time they run. Kubernetes reports this with the status ImagePullBackoff for each scheduled job that runs and attempts to use the previous container image name.

Solution

You must use kubectl patch to update the Kubernetes container image names and then delete any scheduled jobs that are stalled.

Run the following script to update the scheduled jobs to use the new Kubernetes container images.

for c in `./scloud.v4 streams list-connections | jq '.items[] | select(.connectorId == "aws-s3" or .connectorId == "aws-cloudwatch-metrics" or .connectorId == "aws-metadata" or .connectorId == "gcp-monitoring-metrics" or .connectorId == "microsoft-365" or .connectorId == "azure-monitor-metrics") .id' | sed -e 's/"//g'` ;
do
kubectl patch cronjob job-trigger-$c -n lsdc -p "{\"spec\": {\"jobTemplate\": {\"spec\": {\"template\": {\"spec\": {\"containers\": [ { \"name\": \"job-trigger\", \"image\": \"leader.telekube.local:5000/lsdc/job_trigger:release-$(cat ./dsp-version)\"}]}}}}}}"
done

Run the following script to delete the stalled scheduled jobs.

for c in `kubectl -n lsdc get pods -o json | jq '.items[] | select(.status.containerStatuses[].state.waiting.reason == "ImagePullBackOff") .metadata.labels."job-name"' | sed -e 's/"//g'` ;
do
kubectl -n lsdc delete jobs $c
done

The DSP installer fails to complete due to clocks being out of sync

During DSP installation, the console returns the following error message.

Operation failure: servers ip-10-216-29-75 and ip-10-216-29-6 clocks are out of sync: Fri Sep 11 22:23:01.863 UTC and Fri Sep 11 22:23:02.562 UTC respectively, sync the times on servers before install, e.g. using ntp

Cause

The time difference between servers is greater than 300 milliseconds.

Solution

Synchronize the system clocks on each node. For most environments, Network Time Protocol (NTP) is the best approach. Consult the system documentation for the particular operating systems on which you are running the Splunk Data Stream Processor. If you are running DSP on an AWS EC2 environment, see "Setting the time for your Linux instance" in the Amazon Web Services documentation. If you are running DSP on a different environment, see "NTP" in the Debian documentation or the Chrony documentation.

Network bridge driver loading issues

Depending on the system configuration, network bridge drivers may not be loaded. If they are not loaded, the install fails at the /health phase. See installation checklist.

Installation failure due to disabled Network Bridge Driver

The installation fails with the following error message:

[ERROR]: failed to execute phase "/health" planet is not running yet: &{degraded [{ 10.216.31.29 master  degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.218 master  degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.252 master  healthy []}]} (planet is not running yet: &{degraded [{ 10.216.31.29 master  degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.218 master  degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.252 master  healthy []}]})

You must do the following on each node:

sysctl -w net.bridge.bridge-nf-call-iptables=1

echo net.bridge.bridge-nf-call-iptables=1 >> /etc/sysctl.d/10-bridge-nf-call-iptables.conf

Then, restart the installation process.

Error:"Unable to create inotify watch (no space left on device)"

You might encounter this error when running gravity status on your cluster.

Cause

There are several potential causes for this error. Your CPU might be excessively busy or you have run out of space in your container.

Solution

Increase fs.inotify.max_user_watches in /etc/sysctl.d/99-sysctl.conf to 1000000.

On each node, edit /etc/sysctl.d/99-sysctl.conf.
Add the following line to the file:
```
fs.inotify.max_user_watches=1000000
```
Save your changes.
From the command line of your master node, type the following command to deploy the changes.
```
sysctl -p /etc/sysctl.d/99-sysctl.conf
```

Failing Flink checkpoints

You see Flink checkpoints failing to complete in time.

Cause

This could be due to known performance limitations with Minio.

Solution

Enable the sysctl configuration net.netfilter.nf_conntrack_tcp_be_liberal. You can do this by creating a configuration file in /etc/sysctl.d. In this example, we'll create a configuration file called 99-dsp.conf, where 99 means that this configuration will be applied last to ensure that it is not overwritten by accident.

On each node, create and edit a /etc/sysctl.d/99-dsp.conf file and enable the net.netfilter.nf_conntrack_tcp_be_liberal setting.
```
echo net.netfilter.nf_conntrack_tcp_be_liberal=1 >> /etc/sysctl.d/99-dsp.conf
```
From the command line of your master node, type the following command to deploy the changes.
```
sysctl -p /etc/sysctl.d/99-dsp.conf
```

Unable to log in to the Splunk Cloud Services CLI

Logging in to the Splunk Cloud Services CLI results in a failed to get session token: failed to get valid response from csrfToken endpoint: Get "https://<ip_addr>:31000/csrfToken": x509: cannot validate certificate for <ip> because it doesn't contain any IP SANs error.

Cause

The Splunk Cloud Services CLI configuration file is incorrectly configured.

Solution

Make sure that your Splunk Cloud Services CLI settings are configured correctly, given the particular version of the Splunk Cloud Services CLI that you are using. See Configure the Splunk Cloud Services CLI.

DSP UI times out

The DSP UI appears to find the master node but fails to load.

Cause

The master node's IP address has been changed, and the DSP UI is trying to redirect your browser to a private IP. Such IP reassignments are common with various public cloud providers when servers are stopped.

Solution

Reconfigure the DSP UI redirect URL. See Configure the Data Stream Processor UI redirect URL.

My data is not making it into my pipeline

If data is not making it into your activated pipelines, check to see whether all the ingestion services are running in Kubernetes.

Cause

One of the ingestion services could be down.

Solution

Make sure that all the ingest services are running. The ingest services are: ingest-hec, ingest-s2s, and splunk-streaming-rest.

kubectl get pods -n dsp

Related answers from Splunk Community

Troubleshoot your Splunk Data Stream Processor deployment

Support

[ERROR]: cannot allocate memory

Cause

Solution

[ERROR]: waiting for agents to join:

Cause

Solution

[ERROR]: The following pre-flight checks failed:

Cause

Solution

[ERROR]: The following pre-flight checks failed: XXGB available space left on /var/data, minimum of 175GB is required

Cause

Solution

ImagePullBackoff status shown in Kubernetes after upgrading DSP

Cause

Solution

The DSP installer fails to complete due to clocks being out of sync

Cause

Solution

Network bridge driver loading issues

Installation failure due to disabled Network Bridge Driver

Error:"Unable to create inotify watch (no space left on device)"

Cause

Solution

Failing Flink checkpoints

Cause

Solution

Unable to log in to the Splunk Cloud Services CLI

Cause

Solution

DSP UI times out

Cause

Solution

My data is not making it into my pipeline

Cause

Solution

Comments

Troubleshoot your Splunk Data Stream Processor deployment

Was this topic useful?