Splunk® Enterprise

Knowledge Manager Manual

Download manual as PDF

Download topic as PDF

Manage summary index gaps

The accuracy of your summary index searches can be compromised if the summary indexes involved have gaps in their collected data.

Gaps in summary index data can come about for a number of reasons:

  • A summary index initially only contains events from the point that you start data collection: Don't lose sight of the fact that summary indexes won't have data from before the summary index collection start date--unless you arrange to put it in there yourself with the backfill script.
  • Splunk deployment outages: If your Splunk deployment goes down for a significant amount of time, there's a good chance you'll get gaps in your summary index data, depending on when the searches that populate the index are scheduled to run.
  • Searches that run longer than their scheduled intervals: If the search you're using to populate the scheduled search runs longer than the interval that you've scheduled it to run on, then you're likely to end up with gaps because Splunk software won't run a scheduled search again when a preceding search is still running. For example, if you were to schedule the index-populating search to run every five minutes, you'll have a gap in the index data collection if the search ever takes more than five minutes to run.

Note: For general information about creating and maintaining summary indexes, see "Use summary indexing for increased reporting efficiency" in the Knowledge Manager manual.

Use the backfill script to add other data or fill summary index gaps

If you have Splunk Enterprise, you can use the fill_summary_index.py script, which backfills gaps in summary index collection by running the saved searches that populate the summary index as they would have been executed at their regularly scheduled times for a given time range. In other words, even though your new summary index only started collecting data at the start of this week, if necessary you can use fill_summary_index.py to fill the summary index with data from the past month.

In addition, when you run fill_summary_index.py you can specify an App and schedule backfill actions for a list of summary index searches associated with that App, or simply choose to backfill all saved searches associated with the App.

When you enter the fill_summary_index.py commands through the CLI, you must provide the backfill time range by indicating an "earliest time" and "latest time" for the backfill operation. You can indicate the precise times either by using relative time identifiers (such as -3d@d for "3 days ago at midnight") or by using UTC epoch numbers. The script automatically computes the times during this range when the summary index search would have been run.

NOTE: To ensure that the fill_summary_index.py script only executes summary index searches at times that correspond to missing data, you must use -dedup true when you invoke it.

The fill_summary_index.py script requires that you provide necessary authentication (username and password). If you know the valid Splunk Enterprise key when you invoke the script, you can pass it in via the -sk option.

The script is designed to prompt you for any required information that you fail to provide in the command line, including the names of the summary index searches, the authentication information, and the time range.

Examples of fill_summary_index.py invocation

If this is your situation:

You need to backfill all of the summary index searches for the splunkdotcom App for the past month--but you also need to skip any searches that already have data in the summary index:

Then you'd enter this into the CLI:

./splunk cmd python fill_summary_index.py -app splunkdotcom -name "*" -et -mon@mon -lt @mon -dedup true -auth admin:changeme

If this is your situation:

You need to backfill the my_daily_search summary index search for the past year, running no more than 8 concurrent searches at any given time (to reduce impact on performance while the system collects the backfill data). You do not want the script to skip searches that already have data in the summary index. The my_daily_search summary index search is owned by the "admin" role.

Then you'd enter this into the CLI:

./splunk cmd python fill_summary_index.py -app search -name my_daily_search -et -y -lt now -j 8 -owner admin -auth admin:changeme

Note: You need to specify the -owner option for searches that are owned by a specific user or role.

What to do if fill_summary_index.py is interrupted while running

If fill_summary_index.py is interrupted, look for a log directory in the app that you are invoking the process from, such as Search. In that directory you should find an empty temp file named fsidx*lock.

Delete this temp file and you should be able to restart fill_summary_index.py.

fill_summary_index.py usage and commands

In the CLI, start by entering:

python fill_summary_index.py

...and add the required and optional fields from the table below.

Note: <boolean> options accept the values 1, t, true, or yes for "true" and 0, f, false, or no for "false."

Field Value
-et <string> Earliest time (required). Either a UTC time or a relative time string.
-lt <string> Latest time (required). Either a UTC time or a relative time string.
-app <string> The app context to use (defaults to None).
-name <string> Specify a single saved search name. Can specify multiple times to provide multiple names. Use the wildcard symbol ("*") to specify all enabled, scheduled saved searches that have a summary index action.
-names <string> Specify a comma seperated list of saved search names.
-namefile <filename> Specify a file with a list of saved search names, one per line. Lines beginning with a # are considered comments and ignored.
-owner <string> The user context to use (defaults to "None").
-index <string> Identifies the summary index that the saved search populates. If the index is not provided, the backfill script tries to determine it automatically. If this attempt at auto index detection fails, the index defaults to "summary".
-auth <string> The authentication string expects either <username> or <username>:<password>. If only a username is provided, the script requests the password interactively.
-sleep <float> Number of seconds to sleep between each search. Default is 5 seconds.
-j <int> Maximum number of concurrent searches to run (default is 1).
-dedup <boolean> When this option is set to true, the script does not run saved searches for a scheduled timespan if data already exists in the summary index for that timespan. This option is set to false by default.

Note: This option has no connection to the dedup command in the search language. The script does not have the ability to perform event-level data analysis. It cannot determine whether certain events are duplicates of others.
-nolocal <boolean> Specifies that the summary indexes are not on the search head but are on the indexes instead, if you are working with a distributed environment. To be used in conjunction with -dedup.
-showprogress <boolean> When this option is set to true, the script periodically shows the done progress for each currently running search that it spawns. If this option is unused, its default is false.
Advanced options: these should not be used in almost all cases.
-trigger <boolean> When this option is set to false, the script runs each search but does not trigger the summary indexing action. If this option is unused its default is true.
-dedupsearch <string> Indicates the search to be used to determine if data corresponding to a particular saved search at a specific scheduled times is present.
-distdedupsearch <string> Same as -dedupsearch except that this is a distributed search string. It does not limit its scope to the search head. It looks for summary data on the indexers as well.
-namefield <string> Indicates the field in the summary index data that contains the name of the saved search that generated that data.
-timefield <string> Indicates the field in the summary index data that contains the scheduled time of the saved search that generated that data.
Use summary indexing for increased reporting efficiency
Configure summary indexes

This documentation applies to the following versions of Splunk® Enterprise: 6.0, 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.0.6, 6.0.7, 6.0.8, 6.0.9, 6.0.10, 6.0.11, 6.0.12, 6.0.13, 6.0.14, 6.1, 6.1.1, 6.1.2, 6.1.3, 6.1.4, 6.1.5, 6.1.6, 6.1.7, 6.1.8, 6.1.9, 6.1.10, 6.1.11, 6.1.12, 6.1.13, 6.2.0, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.2.5, 6.2.6, 6.2.7, 6.2.8, 6.2.9, 6.2.10, 6.2.11, 6.2.12, 6.2.13, 6.3.0, 6.3.1, 6.3.2, 6.3.3, 6.3.4, 6.3.5, 6.3.6, 6.3.7, 6.3.8, 6.3.9, 6.3.10, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2


is running the fill_summary_index.py the only way to backfill old data? Lets say I've created a monthly Pages report and have scheduled it in June to pull data from the month prior. When July hits, the scheduled report should run all of June's data. Would it be possible to manually run the collect or si command in Splunk web from Jan 1 to May 31 to the same index to manually back fill old data for this pages report? If so, are there any drawbacks to doing it manually as opposed to running the python script especially for scenarios where I'm starting the summary index and need old data asap.

March 18, 2016

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters