Splunk® Data Stream Processor

Install and administer the Data Stream Processor

On October 30, 2022, all 1.2.x versions of the Splunk Data Stream Processor will reach its end of support date. See the Splunk Software Support Policy for details. For information about upgrading to a supported version, see the Upgrade the Splunk Data Stream Processor topic.
This documentation does not apply to the most recent version of Splunk® Data Stream Processor. For documentation on the most recent version, go to the latest release.

Data retention policies

Data retention policies are sets of rules that determine how long data remains available for consumption from a message queue. Typically, streamed data is retained in an available state until a specified interval of time has passed or a maximum amount of retained data is exceeded. If a DSP pipeline is down for longer than this data retention period, data loss can occur.

The retention policies that apply to the data being ingested into DSP pipelines vary depending on the source of the data. Different message queues may be used to handle data from different sources, and as a result, different data retention policies apply.

  • If the data comes from a source that is supported by the Splunk DSP Firehose or one of its subset ingestion methods (Ingest service, Forwarders service, Collect service, DSP HEC, and Syslog), then the data is subject to the retention policies configured in the Apache Pulsar message bus used by the Splunk DSP Firehose.
  • If the data comes from another source, then the data is subject to the retention policies configured in that data source.

The following table describes how retention policies are determined for each type of data source, and where to find more information about each policy:

Type of data source What determines the data retention policy For more information
Data sources supported by the Splunk DSP Firehose, which include:
  • Splunk forwarders
  • Ingest REST API
  • HTTP clients (through DSP HEC)
  • Syslog
  • Amazon CloudWatch
  • Amazon S3
  • Amazon Web Services (AWS) metadata
  • Google Cloud Monitoring
  • Microsoft 365
  • Microsoft Azure Monitor
The configuration of the Apache Pulsar message bus in DSP. See the Splunk DSP Firehose retention policies section on this page.
Amazon Kinesis Data Streams The configuration of the Kinesis data stream. Search for "Changing the Data Retention Period" in the Amazon Kinesis Data Streams Developer Guide.
Apache or Confluent Kafka The configuration of the Kafka topic. If retention policies are not configured on the topic, then default policies on the broker are used instead. Search for "Topic-Level Configs" in the Apache Kafka documentation, or "Topic Configurations" in the Confluent Kafka documentation.
Apache Pulsar The configuration of the namespace that the Pulsar topic belongs to. Search for "Message retention and expiry" in the Apache Pulsar documentation.
Google Cloud Pub/Sub The configuration of the subscription. Search for "Managing Subscriptions" in the Google Cloud Pub/Sub documentation.
Microsoft Azure Event Hubs The configuration of the event hub. Search for "Azure Event Hubs quotas and limits" and "Create an event hub" in the Event Hubs documentation.

Splunk DSP Firehose retention policies

Starting in DSP 1.1.0, DSP uses Apache Pulsar as the message bus for all data sent to the Splunk DSP Firehose. This means that all data ingestion methods that are a subset of the Splunk DSP Firehose (Ingest service, Forwarders service, Collect service, DSP HEC, Syslog) use Pulsar as the message bus.

By default, all data received by Splunk DSP Firehose is stored in a Pulsar topic for 24 hours. The oldest data in the topic gets deleted first. You can adjust the data retention policy by following the steps described in the Set the Splunk DSP Firehose retention policy section.

For more information about Apache Pulsar and its data retention policies, search for "Message retention and expiry" in the Apache Pulsar documentation.

Set the Splunk DSP Firehose retention policy

  1. From the DSP directory of a master node, log in to the Pulsar broker pod.
    $ kubectl exec -it broker-0 -n pulsar /bin/bash
  2. Navigate to the pulsar/ directory.
    $ cd /streamlio/pulsar/
  3. Update the retention policy.
    $ ./bin/pulsar-admin namespaces set-retention --time <TIME> --size <SIZE> DSP/default-ingest
    Flag Description Examples
    --time The retention time in minutes, hours, days, or weeks. Set to 0 for no retention and -1 for infinite time retention. Defaults to 24 hours. 100m, 3h, 2d, 5w
    --size The retention size limit. Set to 0 for no retention or -1 for infinite size retention. Defaults to 0. 10M, 16G, 3T

Get retention policy

You can get the retention policy for a namespace by specifying the namespace. The output will be a JSON object with two keys: retentionTimeInMinutes and retentionSizeInMB.

To see the current retention policy:

  1. From the DSP directory of a master node, log in to the Pulsar broker pod.
    $ kubectl exec -it broker-0 -n pulsar /bin/bash
  2. Navigate to the pulsar/ directory.
    $ cd /streamlio/pulsar/
  3. Run the following command to see the current retention policy.
    $ ./bin/pulsar-admin namespaces get-retention DSP/default-ingest

A response containing retentionTimeInMinutes and retentionSizeInMB is returned.

{
  "retentionTimeInMinutes" : 1440,
  "retentionSizeInMB" : 0
}
Last modified on 26 March, 2021
Configure connections to external services   Resizing a cluster by adding or removing nodes

This documentation applies to the following versions of Splunk® Data Stream Processor: 1.2.0, 1.2.1-patch02


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters