Troubleshoot your Splunk Data Stream Processor deployment
Use this information to troubleshoot issues relating to the Splunk Data Stream Processor (DSP) installation and deployment.
Support
To report bugs or receive additional support, do the following:
- Ask questions and get answers through community support at Splunk Answers.
- If you have a support contract, file a case using the Splunk Support Portal. See Support and Services.
- If you have a support contract, contact Splunk Customer Support.
- To get professional help with optimizing your Splunk software investment, see Splunk Services.
When contacting Splunk Customer Support, provide the following information:
Information to provide | Notes |
---|---|
Pipeline ID | To view the ID of a pipeline, open the pipeline in DSP, then click the options button and click Update Pipeline Metadata. |
Pipeline name | N/A |
DSP version | To view your DSP version, in the product UI, click Help & Feedback > About. |
DSP diagnostic report | A DSP diagnostic report contains all DSP application logs as well as system and monitoring logs.
The command creates a diagnostic report named dsp-report-<timestamp>.tar.gz in the working directory. |
Summary of the problem and any additional relevant information | N/A |
[ERROR]: cannot allocate memory
DSP on RHEL7/CENTOS7 fails with a warning message similar to the following:
Warning FailedCreatePodContainer 5s (x2 over 16s) kubelet, 10.234.0.181 unable to ensure pod container exists: failed to create container for [kubepods burstable poded9bd025-c3e4-4ebb-a5b7-2a7adab9742d] : mkdir /sys/fs/cgroup/memory/kubepods/burstable/poded9bd025-c3e4-4ebb-a5b7-2a7adab9742d: cannot allocate memory
Cause
This warning is caused by a bug in the older RHEL7/CENTOS7 kernels such as v3.10.0-1127.19.1.el7 in combination with systemd v 231 or earlier where kernel-memory cgroups are not cleaned up properly. This manifests as a memory allocation error when new pods are created. For more information see the Kubernetes bug report: Kubelet CPU/Memory Usage linearly increases using CronJob.
Solution
Upgrade systemd to v232 or later, or disable kernel memory accounting by setting cgroup.memory=nokmem
.
Do the following steps to disable kernel memory accounting:
- Find the kernel version.
grubby --default-kernel
- Disable the kernel memory accounting by adding
nokmem
to the kernel boot parameters.grubby --args=cgroup.memory=nokmem --update-kernel /boot/<kernel_version> - Reboot the host.
[ERROR]: waiting for agents to join:
You may see this error while running the installer.
Cause
The DSP installer waits five minutes for nodes to join your cluster. If your cluster does not have a minimum of 3 nodes and five minutes have elapsed, then the installer times out.
Solution
You must remove all nodes from Gravity and start the installation process again.
- On the master node, run
sudo gravity leave --force --confirm
to force the node to leave the gravity cluster. - Delete the
.gravity
directory:sudo rm -rf gravity/.gravity
- Make sure that you have all three nodes prepared, and then start the installation process over again.
[ERROR]: The following pre-flight checks failed:
The DSP installer fails to complete because of pre-flight checks.
Cause
The DSP installer runs pre-flight checks to make sure that your system meets the minimum requirements required for DSP. If your system does not meet the minimum requirements necessary for DSP, the installer quits installation.
Solution
The installer returns which pre-flight checks failed. Using that information, double-check that you meet the mandatory Hardware and Software requirements for DSP. See Hardware and Software Requirements.
[ERROR]: The following pre-flight checks failed: XXGB available space left on /var/data, minimum of 175GB is required
The DSP installer fails to complete because there isn't enough space left on /var/data
even if another disk volume or partition is specified with --location
.
Cause
The DSP installer runs a pre-flight check to make sure that your system has enough drive space on /var/data
even if you have used --location
to install DSP on another drive or partition. If there isn't enough disk space on /var
, the pre-flight check fails with a disk space error.
Solution
Add a symlink from your intended install location to /var/data
. For example, if you want to use --location /data
, then add the following symlink.
ln -s /data /var/data
ImagePullBackoff status shown in Kubernetes after upgrading DSP
After upgrading DSP, Kubernetes reports the status ImagePullBackoff
for previously scheduled jobs in the Amazon CloudWatch Metrics, Amazon S3, AWS Metadata, Google Cloud Monitoring, Microsoft 365, and Microsoft Azure Monitor connectors.
Cause
If you do not deactivate all schedules in the Amazon CloudWatch Metrics, Amazon S3, AWS Metadata, Google Cloud Monitoring, Microsoft 365, or Microsoft Azure Monitor connectors before upgrading your DSP deployment, the Kubernetes container image name used by previously scheduled jobs in these connectors is not updated. The scheduled jobs continue to use the previous Kubernetes container image name and the jobs will stall the next time they run. Kubernetes reports this with the status ImagePullBackoff
for each scheduled job that runs and attempts to use the previous container image name.
Solution
You must use kubectl patch
to update the Kubernetes container image names and then delete any scheduled jobs that are stalled.
- Run the following script to update the scheduled jobs to use the new Kubernetes container images.
for c in `./scloud.v4 streams list-connections | jq '.items[] | select(.connectorId == "aws-s3" or .connectorId == "aws-cloudwatch-metrics" or .connectorId == "aws-metadata" or .connectorId == "gcp-monitoring-metrics" or .connectorId == "microsoft-365" or .connectorId == "azure-monitor-metrics") .id' | sed -e 's/"//g'` ; do kubectl patch cronjob job-trigger-$c -n lsdc -p "{\"spec\": {\"jobTemplate\": {\"spec\": {\"template\": {\"spec\": {\"containers\": [ { \"name\": \"job-trigger\", \"image\": \"leader.telekube.local:5000/lsdc/job_trigger:release-$(cat ./dsp-version)\"}]}}}}}}" done
- Run the following script to delete the stalled scheduled jobs.
for c in `kubectl -n lsdc get pods -o json | jq '.items[] | select(.status.containerStatuses[].state.waiting.reason == "ImagePullBackOff") .metadata.labels."job-name"' | sed -e 's/"//g'` ; do kubectl -n lsdc delete jobs $c done
The DSP installer fails to complete due to clocks being out of sync
During DSP installation, the console returns the following error message.
Operation failure: servers ip-10-216-29-75 and ip-10-216-29-6 clocks are out of sync: Fri Sep 11 22:23:01.863 UTC and Fri Sep 11 22:23:02.562 UTC respectively, sync the times on servers before install, e.g. using ntp
Cause
The time difference between servers is greater than 300 milliseconds.
Solution
Synchronize the system clocks on each node. For most environments, Network Time Protocol (NTP) is the best approach. Consult the system documentation for the particular operating systems on which you are running the Splunk Data Stream Processor. If you are running DSP on an AWS EC2 environment, see "Setting the time for your Linux instance" in the Amazon Web Services documentation. If you are running DSP on a different environment, see "NTP" in the Debian documentation or the Chrony documentation.
Network bridge driver loading issues
Depending on the system configuration, network bridge drivers may not be loaded. If they are not loaded, the install fails at the /health
phase. See installation checklist.
Installation failure due to disabled Network Bridge Driver
The installation fails with the following error message:
[ERROR]: failed to execute phase "/health" planet is not running yet: &{degraded [{ 10.216.31.29 master degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.218 master degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.252 master healthy []}]} (planet is not running yet: &{degraded [{ 10.216.31.29 master degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.218 master degraded [kubernetes requires net.bridge.bridge-nf-call-iptables sysctl set to 1, https://www.gravitational.com/docs/faq/#bridge-driver]} { 10.216.31.252 master healthy []}]})
You must do the following on each node:
sysctl -w net.bridge.bridge-nf-call-iptables=1
echo net.bridge.bridge-nf-call-iptables=1 >> /etc/sysctl.d/10-bridge-nf-call-iptables.conf
Then, restart the installation process.
Error:"Unable to create inotify watch (no space left on device)"
You might encounter this error when running gravity status
on your cluster.
Cause
There are several potential causes for this error. Your CPU might be excessively busy or you have run out of space in your container.
Solution
Increase fs.inotify.max_user_watches
in /etc/sysctl.d/99-sysctl.conf
to 1000000
.
- On each node, edit
/etc/sysctl.d/99-sysctl.conf
. - Add the following line to the file:
fs.inotify.max_user_watches=1000000
- Save your changes.
- From the command line of your master node, type the following command to deploy the changes.
sysctl -p /etc/sysctl.d/99-sysctl.conf
Failing Flink checkpoints
You see Flink checkpoints failing to complete in time.
Cause
This could be due to known performance limitations with Minio.
Solution
Enable the sysctl
configuration net.netfilter.nf_conntrack_tcp_be_liberal
. You can do this by creating a configuration file in /etc/sysctl.d
. In this example, we'll create a configuration file called 99-dsp.conf
, where 99
means that this configuration will be applied last to ensure that it is not overwritten by accident.
- On each node, create and edit a
/etc/sysctl.d/99-dsp.conf
file and enable thenet.netfilter.nf_conntrack_tcp_be_liberal
setting.echo net.netfilter.nf_conntrack_tcp_be_liberal=1 >> /etc/sysctl.d/99-dsp.conf
- From the command line of your master node, type the following command to deploy the changes.
sysctl -p /etc/sysctl.d/99-dsp.conf
Unable to log in to the Splunk Cloud Services CLI
Logging in to the Splunk Cloud Services CLI results in a failed to get session token: failed to get valid response from csrfToken endpoint: Get "https://<ip_addr>:31000/csrfToken": x509: cannot validate certificate for <ip> because it doesn't contain any IP SANs
error.
Cause
The Splunk Cloud Services CLI configuration file is incorrectly configured.
Solution
Make sure that your Splunk Cloud Services CLI settings are configured correctly, given the particular version of the Splunk Cloud Services CLI that you are using. See Configure the Splunk Cloud Services CLI.
DSP UI times out
The DSP UI appears to find the master node but fails to load.
Cause
The master node's IP address has been changed, and the DSP UI is trying to redirect your browser to a private IP. Such IP reassignments are common with various public cloud providers when servers are stopped.
Solution
Reconfigure the DSP UI redirect URL. See Configure the Data Stream Processor UI redirect URL.
My data is not making it into my pipeline
If data is not making it into your activated pipelines, check to see whether all the ingestion services are running in Kubernetes.
Cause
One of the ingestion services could be down.
Solution
Make sure that all the ingest services are running. The ingest services are: ingest-hec
, ingest-s2s
, and splunk-streaming-rest
.
kubectl get pods -n dsp
Use the Splunk App for DSP to monitor your DSP deployment |
This documentation applies to the following versions of Splunk® Data Stream Processor: 1.1.0, 1.2.0, 1.2.1-patch02, 1.2.1, 1.2.2-patch02, 1.2.4, 1.2.5
Feedback submitted, thanks!