Make scheduled reports durable to prevent event loss

Individual runs of scheduled reports sometimes develop errors. Sometimes these errors cause scheduled search jobs to return incomplete result sets. For example, a search might return incomplete results when an indexer fails to locate a bucket of events. Other times, scheduled report jobs encounter errors that prevent them from returning any results at all. For example, you will not see search results when a resource bottleneck causes the search scheduler to skip a run of a scheduled report.

Consider configuring durable search processing for a scheduled report that must return complete search results for each of its scheduled report runs. Durable search processing ensures that the scheduled report does not lose events over time, even when errors occur. It does this by scheduling backfill search jobs to replace the results of failed searches.

Durable search considerations and prerequisites

You can apply durable search processing to any scheduled report that must have complete search results without duplicate events. Durable search processing can be applied to event and metric searches.

The results of searches that use durable search processing can be delayed when one or more Splunk search heads or search peers go into an error condition. The durable search process will not schedule backfill search jobs until the system is restored to normal operation. When this happens it runs the backfill jobs and returns complete search results according to the original time windows of the scheduled reports that returned partial results or did not run on their schedule.

The durable search process is designed so that it should not significantly increase the workload on search heads and search peers. It does not allow concurrent jobs to run for the same scheduled report. Backfill jobs are entirely controlled by the search scheduler and as such can be managed through the same limits.conf settings that you currently apply to your ordinary scheduled reports.

Do not apply durable search processing to the following kinds of scheduled reports:

Scheduled reports that are expected to return partial results each time they run.
Scheduled real-time reports (in other words, scheduled reports that use real-time search functionality).
Scheduled reports that must use the continuous scheduling mode. Durable searches must use the real-time scheduling mode.

You can enable durable search for summary indexing searches even though they use the continuous scheduling mode by default. See Set up durable search for a summary index.

For more information about the real-time and continuous scheduling modes, see Configure the priority of scheduled reports.

When distributed search peers do not support durable search

If you use distributed search on Splunk Enterprise, and if you have distributed search peers that run versions of Splunk Enterprise that are lower than 8.0.0, your durable search will fail. This happens because Splunk Enterprise versions lower than 8.0.0 do not support the dispatch.allow_partial_results setting. Durable search needs to be able to guarantee that search results are complete, and it can only do this when search peers can report errors for searches that return partial results.

In addition, you cannot run durable search with search heads that are on a Splunk Enterprise version lower than 8.2.0.

Durable search and workload management

If you use workload management and you want to apply durable search processing to a scheduled search, the best practice is to put the scheduled search into a high-performance workload pool.

If you have a Splunk Cloud Platform deployment, see the Workload Management overview in the Splunk Cloud Platform Admin Manual.
If you have a Splunk Enterprise deployment, see About workload management in Workload management.

How durable search works

The durable search process creates a "durable cursor" that tracks the latest event timestamp for every search job started for a given scheduled report. The search scheduler uses the durable cursor to compute the search time window (earliest time to latest time) when a scheduled report starts a new job.

The durable search process uses the search status audit mechanism to validate whether a given scheduled reports job has returned complete search results. When the durable search process determines that a search job has returned partial results or no results, it discards the partial results and schedules a backfill job to rerun the failed search until the missing results are accounted for.

Each scheduled report that uses durable search processing has only one durable cursor. The durable cursor moves forward only, and it moves only when a search job is completely done. The durable cursor cannot be shared between two or more scheduled report jobs. The value of each durable cursor is stored permanently in the KV Store for your deployment.

Durable search timestamp tracking options

The durable cursor for a scheduled report can track timestamps for either event time (_time) or indexed time (_indextime). When you configure durable search processing for a scheduled report that is running an event search, you choose what kind of timestamp it tracks.

If your scheduled report is running a metric search, you can choose only _time. The metric data points returned by metric searches do not have the _indextime field. If your target data (for example, summary index or data model acceleration) for a nonmetric search does not contain _indextime, you can choose only _time.

Event time marks the time when each event is generated. In production, events can be ingested and indexed late. If you are tracking them by their event timestamp, their lateness puts them out of sequence with other events. This means events that come one or more scheduled report intervals late can appear to be event losses. If you run searches to catch up on the late-arriving events, you might end up with event duplicates.

On the other hand, when the durable search process tracks the indexed timestamps of the events of a scheduled report, it is simply tracking the times that each event is asynchronously indexed. When events are tracked by their indexed time, they never appear to be out of sequence.

If the _time timestamp of the events is not crucial to the results of the search, the best practice is to set the durable search process to track events by their indexed timestamps. This choice is particularly appropriate for streaming searches that return raw events.

If the results of the search depend upon _time values, or if the search uses transforming commands to aggregate results into stats tables, you may be better off setting the durable search process to track events by their event timestamps. If the searched data is constructed with data model acceleration or summarized into a summary index, set the durable search process to track events by the event time, _time.

Set time lag for late-arriving events

You can try to account for late-arriving events by setting a lag time to delay the search time window of the durable search process. It adjusts the time filter (both the earliest time and latest time) of a search job to match the event latency. For example, if you set the lag time to 60 seconds, both the earliest and latest time are set 60 seconds back.

This setting can apply to both timestamp tracking methods. If you are tracking events by their indexed time, set the lag time to 60 seconds or less to match the delay for writing memory to disk. If you are tracking events by their event time, you can use the following search to find the maximum latency of events and then set the lag time to the value of maxLatency:

index=main | eval latency = _indextime - _time | stats max(latency) AS maxLatency

Setting a lag time delays the receipt of search results. If you cannot tolerate any delay, set the lag time to 0.

Durable search backfill jobs

The durable search process schedules backfill jobs for scheduled report jobs that fail or return partial results. It does this when either of the following conditions are met:

The durable cursor falls behind the current scheduled report start time (sched_time): durable cursor + 2 * cron_interval <= sched_time
The last scheduled report start time falls behind the current clock time: last_sched_time + cron_interval <= NOW

The backfill search jobs are scheduled for the original time ranges of the missing or incomplete search jobs.

To reduce workload on the system, the durable search process waits until a search job completes successfully for a scheduled report before it schedules a backfill job for missing or partial data in earlier runs of that scheduled report. For example, if three runs in a row of a scheduled report are compromised, the durable search process won't set up a backfill search job until the fourth job completes successfully.

The durable search process does not allow concurrent search jobs to run for the same scheduled report, including backfill jobs. Whether a backfill job is scheduled and dispatched is fully under the control of the search scheduler and whatever workload balancing policies are applied to your deployment. If you find that the durable search process is contributing to spikes in workload, consider adjusting limits.conf settings related to job quota, scheduling, and workload balancing to resolve the situation.

One backfill job or several?

You can determine whether the durable search process backfills gaps with a single search job that covers the entire time range of the gap at once, or if it backfills gaps with multiple search jobs that run on the same interval as the parent scheduled report.

The option you should select depends on the type of search you are backfilling.

Select the single backfill job option if the scheduled report is a streaming search that returns raw events.
Select the multiple backfill job option if the scheduled report is a transforming search that returns statistical aggregations of events.

If you are not sure which backfill option to apply, select auto. With auto, the durable search process applies the gap backfill method that is appropriate for the report.

How many backfill jobs for long gaps?

Gaps between successful runs of a scheduled report can be long. For example, you might run into a persistent server failure condition that results in a gap that spans days, weeks, or even months of scheduled reports.

If you are running a transforming search and you have set up durable search to fill gaps that represent multiple backfill jobs, you might want to limit the maximum number of backfill jobs that the durable search process can schedule. A reasonable limit for a transforming search is no more than a week's worth of backfill jobs.

Set up durable search for a summary index

The scheduled reports that populate summary indexes are prime candidates for durable search processing. Summary indexes that have been built by faulty scheduled reports will have missing events. Searches of summary indexes with missing events return inaccurate results.

When you enable summary indexing for a scheduled report, you can also configure durable search processing for the scheduled report.

When you enable summary indexing for a scheduled report, the Splunk software automatically switches its scheduling mode from real-time to continuous. The continuous mode is required for ordinary summary indexing reports. However, if you then configure durable search for that report, the Splunk software switches its scheduling mode back to real-time scheduling, as that is the required scheduling mode for scheduled report that use durable search processing.

For more information about the real-time and continuous scheduling modes, see Configure the priority of scheduled reports.

Prerequisites

This task assumes you have created a scheduled report for the purpose of populating a summary index, have enabled it for summary indexing, and have set the summary indexing settings for it. For more information about these things, see the following topics in the Knowledge Manager Manual:

Steps

Select Settings > Searches, Reports, and Alerts.
Locate a report that you have created and scheduled for the purpose of populating a summary index. Select Edit > Edit Summary Indexing.
If you have not enabled the report for summary indexing, select Enable summary indexing and fill out the summary indexing settings as appropriate for your summary index.
Select Enable durable search.
Select a timestamp tracking option. Choose Event timestamp only if the search is a metric search, or if it is an event search that must filter events on their original timestamps. Otherwise, select Indexed timestamp.
(Optional) Change the Durable search time lag value from the default of 60 seconds only if the actual maximum search latency for your deployment is not 60 seconds.
(Optional) For a transforming summary indexing report, you can leave Lost result backfill method at the default of Auto. Select Multiple jobs only if you must ensure that the durable search process does not use one backfill job to cover large gaps, as that approach is not recommended for transforming searches.
(Optional) Provide a Max backfill jobs number that is appropriate for the interval of the report. Ideally this number should be limited to a week of backfill searches. For example, if your report runs every 12 hours, you should enter 14 in this field.
If you want the number of backfill jobs to be unlimited, set Max backfill jobs to 0.
Click Save.

Set up durable search for a scheduled search

Durable search processing is enabled by default, but is only used on a scheduled search if you configure durable search settings for that scheduled search. You can set up durable search for any scheduled search or report.

Prerequisites

This task assumes you have created and scheduled the search that you want to set up as a durable search.

Save your report. See Create and edit reports.
Schedule your search. See Schedule reports.

Steps

Select Settings > Searches, Reports, and Alerts.
Locate a search that you have created and scheduled and now intend to make durable. For this search, select Edit > Advanced Edit.
Configure durable search for this search by setting the value of durable.track_time_type to either _time (event time) or _indextime (indexed time). Choose _time only if one of the following is true:
- The search is a metric search.
- The search is an event search that must filter events on their original timestamps.
- The search is an event search that doesn't contain "_indextime" in the target data (summary index or data model acceleration).
Otherwise, select _indextime.
(Optional) Change durable.backfill_type from its default value of auto only in specific situations. Set it to time_interval only if you must ensure that the durable search process does not use one backfill job to cover large gaps, as that method is not recommended for transforming searches. Set it to time_whole if you must ensure that the search process covers large gaps with one single backfill job.
(Optional) Change the durable.lag_time value from the default of 60 seconds only if the actual maximum search latency for your deployment is not 60 seconds.
(Optional) Provide a numeric value for durable.max_backfill_intervals that is appropriate for the interval of the report. Ideally this number should be limited to a week of backfill searches. For example, if your report runs every 12 hours, you should enter 14 in this field.
If you want the number of backfill jobs to be unlimited, set durable.max_backfill_intervals to 0.
Click Save.

To disable durable search for a scheduled search, change the value of durable.track_time to none or give durable.track_time a blank value.

Make scheduled reports durable to prevent event loss

Durable search considerations and prerequisites

When distributed search peers do not support durable search

Durable search and workload management

How durable search works

Durable search timestamp tracking options

Set time lag for late-arriving events

Durable search backfill jobs

One backfill job or several?

How many backfill jobs for long gaps?

Set up durable search for a summary index

Prerequisites

Steps

Set up durable search for a scheduled search

Prerequisites

Steps

Comments

Make scheduled reports durable to prevent event loss

Was this topic useful?