Use summary indexing for increased reporting efficiency
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
- Defining index-populating searches without the special commands
- Summary indexing reporting command usage example
- Schedule the populating search to avoid data gaps and overlaps
Use summary indexing for increased reporting efficiency
Splunk is capable of generating reports of massive amounts of data (100 million events and counting). However, the amount of time it takes to compute such reports is directly proportional to the numbers of events summarized. Plainly put, it can take a lot of time to search through very large data sets. If you only have to do this on an occasional basis, it may not be an issue. But running such reports on a regular schedule can be impractical--and this impracticality only increases exponentially as more and more users in your organization use Splunk to run similar reports.
Use summary indexing to efficiently report on large volumes of data. With summary indexing, you set up a search that extracts the precise information you want, on a frequent basis. Each time Splunk runs this search it saves the results into a summary index that you designate. You can then run searches and reports on this significantly smaller (and thus seemingly "faster") summary index. And what's more, these reports will be statistically accurate because of the frequency of the index-populating search (for example, if you want to manually run searches that cover the past seven days, you might run them on a summary index that is updated on an hourly basis).
Summary indexing allows the cost of a computationally expensive report to be spread over time. In the example we've been discussing, the hourly search to populate the summary index with the previous hour's worth of data would take a fraction of a minute. Generating the complete report without the benefit of summary indexing would take approximately 168 (7 days * 24 hrs/day) times longer.
Perhaps an even more important advantage of summary indexing is its ability to amortize costs over different reports, as well as for the same report over a different but overlapping time range. The same summary data generated on a Tuesday can be used for a report of the previous 7 days done on the Wednesday, Thursday, or the following Monday. It could also be used for a monthly report that needed the average response size per day.
Summary indexing use cases
Example #1 - Run reports over long time ranges for large datasets more efficiently: Imagine you're using Splunk at a company that indexes tens of millions of events per day. You want to set up a dashboard for your employees that, among other things, displays a report that shows the number of page views and visitors each of your Web sites had over the past 30 days, broken out by site.
You could run this report on your primary data volume, but its runtime would be quite long, because Splunk has to sort through a huge number of events that are totally unrelated to web traffic in order to extract the desired data. But that's not all--the fact that the report is included in a popular dashboard means it'll be run frequently, and this could significantly extend its average runtime, leading to a lot of frustrated users.
But if you use summary indexing, you can set up a saved search that collects website page view and visitor information into a designated summary index on a weekly, daily, or even hourly basis. You'll then run your month-end report on this smaller summary index, and the report should complete far faster than it would otherwise because it is searching on a smaller and better-focused dataset.
Example #2 - Building rolling reports: Say you want to run a report that shows a running count of an aggregated statistic over a long period of time--a running count of downloads of a file from a Web site you manage, for example.
First, schedule a saved search to return the total number of downloads over a specified slice of time. Then, use summary indexing to have Splunk save the results of that search into a summary index. You can then run a report any time you want on the data in the summary index to obtain the latest count of the total number of downloads.
For another view, you can watch this Splunk developer video about the theory and practice of summary indexing.
Using the summary indexing reporting commands
If you are new to summary indexing, use the summary indexing reporting commands (
sirare) when you define the search that will populate the summary index. If you use these commands you should use the same search string that you use for the search that you eventually run on the summary index, with the exception that you use regular reporting commands in the latter search.
Note: You do not have to use the si- summary index search commands if you are proficient with the "old-school" way of creating summary-index-populating searches. If you create summary indexes using those methods and they work for you there's no need to update them. In fact, they may be more efficient: there are performance impacts related to the use of the si- commands, because they create slightly larger indexes than the "manual" method does.
In most cases the impact is insignificant, but you may notice a difference if the summary indexes you are creating are themselves fairly large. You may also notice performance issues if you're setting up several searches to report against an index populated by an si- command search.
See the following section if you're interested in designing summary indexes without the help of si- search commands.
Defining index-populating searches without the special commands
In previous versions of Splunk you had to be very careful about how you designed the searches that you used to populate your summary index, especially if the search you wanted to run on the finished summary index involved aggregate statistics, because it meant that you had to carefully set up the "index-populating" search in a way that did not provide incorrect results. For example, if you wanted to run a search on the finished summary index that gave you average response times broken out by server, you'd want to set up a summary-index-populating search that:
- is scheduled to run on a more frequent basis than the search you plan to run against the summary index
- samples a larger amount of data than the search you plan to run against the summary index.
- contains additional search commands that ensure that the index-populating search is generating a weighted average (only necessary if you are looking for an average in the first place)..
The summary index reporting commands take care of the last two points for you--they automatically determine the adjustments that need to be made so that your summary index is populated with data that does not produce statistically inaccurate results. However, you still should arrange for the summary-index-populating search to run on a more frequent basis than the search that you later run against the summary index.
Interested in setting up summary indexes without the si- commands? Find out about the
overlap commands, learn how to devise searches that provide weighted averages, and review an example of summary index configuration via
savedsearches.conf in the topic "Configure summary indexes," in this manual.
Summary indexing reporting command usage example
Let's say you've been running the following search, with a time range of the past year:
eventtype=firewall | top src_ip
This search gives you the top source ips for the past year, but it takes forever to run because it scans across your entire index each time.
What you need to do is create a summary index that is composed of the top source IPs from the "firewall" event type. You can use the following search to build that summary index. You would schedule it to run on a daily basis, collecting the top
src_ip values for only the previous 24 hours each time. The results of each daily search are added to an index named "summary":
eventtype=firewall | sitop src_ip
Note: Summary-index-populating searches are statistically more accurate if you schedule them to run and sample information on a more frequent basis than the searches you plan to run against the finished summary index. So in this example, because we plan to run searches that cover a timespan of a year, we set up a summary-index-populating search that samples information on a daily basis.
Important: When you define summary-index-populating searches, do not pipe other search operators after the main summary indexing reporting command. In other words, don't include additional
| eval commands and the like. Save the extra search operators for the searches you run against the summary indexes, not the search you use to populate it.
Important: The results from a summary-indexing optimized search are stored in a special format that cannot be modified before the final transformation is performed. This means that if you populate a summary index with
... | sistats <args>, the only valid retrieval of the data is:
index=<summary> source=<saved search name> | stats <args>. The search against the summary index cannot create or modify fields before the
| stats <args> command.
Now, let's say you save this search with the name "Summary - firewall top src_ip" (all saved summary-index-populating searches should have names that identify them as such). After your summary index is populated with results, search and report against that summary index using a search that specifies the summary index and the name of the search that you used to populate it. For example, this is the search you would use to get the top source_ips over the past year:
index=summary search_name="summary - firewall top src_ip" |top src_ip
Because this search specifies the search name, it filters out other data that have been placed in the summary index by other summary indexing searches. This search should run fairly quickly, even if the time range is a year or more.
Note: If you are running a search against a summary index that queries for events with a specific
sourcetype value, be aware that you need to use
orig_sourcetype instead. So instead of running a search against a summary index like
...|stats timechart avg(ip) by sourcetype, use
...|stats timechart avg(ip) by orig_sourcetype.
Why do you have to do this? When events are gathered into a summary index, Splunk changes their
sourcetype values to "stash" and moves the original sourcetype values to
Setting up summary index searches in Splunk Web
You can set up summary index searches through the Splunk Web interface. Summary indexing is an alert option for saved, scheduled searches. Once you determine the search that you want to use to populate a summary index, follow these steps:
1. Navigate to Manager > Searches and Reports. Select the name of a previously saved search (or click New to create a new saved search).
2. Under Schedule and alert, select Schedule this search if the search isn't already scheduled. Schedule the search to run on an appropriate interval. Remember that searches that populate summary indexes should run on a fairly frequent basis in order to create statistically accurate final reports. If the search you're running against the summary index is gathering information for the past week, you should have the summary search run on an hourly basis, collecting information for each hour. If you're running searches over the past year's worth of data, you might have the summary index collect data on a daily basis for the past day.
Note: Be sure to schedule the search so that there are no data gaps and overlaps. For more on this see the subtopic on this issue, below.
3. Under Alert, select a Condition value of always.
4. Select an Alert mode of Once per search.
5. Under Summary indexing, select Enable.
6. Select the name of the summary index that the search will populate from the Select the summary index list. The default summary index is named summary. Only indexes that you have permission to write to are listed. You may need to create additional summary indexes if you plan to run a variety of summary index searches. For information about creating new indexes, see "Set up multiple indexes" in the Admin manual. It's a good idea to create indexes that are dedicated to the collection of summary data.
Note: If you enter the name of an index that does not exist, Splunk will run the search on the schedule you've defined, its data will not get saved to a summary index.
7. (Optional) Under Add fields, you can add field/value pairs to the summary index definition. These key/value pairs will be annotated to each event that gets summary indexed, making it easier to find them with later searches. For example, you could add the name of the saved search populating the summary index (report=summary_firewall_top_src_ip) or the name of the index that the search populates (index=summary), and then search on those terms later.
Note: You can also add field/value pairs to the summary index configuration in
savedsearches.conf. For more information, see "Configure summary indexes" in the Knowledge Manager manual.
Schedule the populating search to avoid data gaps and overlaps
To minimize data gaps and overlaps you should be sure to set appropriate intervals and delays in the schedules of searches you use to populate summary indexes.
Gaps in a summary index are periods of time when a summary index fails to index events. Gaps can occur if:
- The summary-index-populating search takes too long to run and runs past the next scheduled run time. For example, if you were to schedule the search that populates the summary to run every 5 minutes when that search typically takes around 7 minutes to run, you would have problems, because the search won't run again when it's still running a preceding search.
- You have forced the summary-index-populating search to use real-time scheduling. You do this by mistakenly changing the search definition in
savedsearches.confso that the
realtime_scheduleattribute is set to
1, enabling real-time scheduling. This setting can result in data collection gaps if you are running several searches concurrently. When you define a summary-index-populating scheduled search in Splunk Web by selecting Enable for summary indexing and saving the search, Splunk automatically sets
0, to ensure that the search never skips a scheduled run. For more information see "Configure the priority of scheduled searches", in this manual.
splunkdgoes down. If Splunk can't index events, you will have gaps in your summary indexes.
Overlaps are events in a summary index (from the same search) that share the same timestamp. Overlapping events skew reports and statistics created from summary indexes. Overlaps can occur if you set the time range of a saved search to be longer than the frequency of the schedule of the search. In other words, don't arrange for an hourly search to gather data for the past 90 minutes.
Note: If you think you have gaps or overlaps in your summary index data, Splunk provides methods of detecting them and either backfilling them (in the case of gaps) or deleting the overlapping events. For more information, see "Manage summary index gaps and overlaps" in the Knowledge Manager manual.
How summary indexing works
In Splunk Web, summary indexing is an alert option for scheduled saved searches. When you run a saved search with summary indexing turned on, its search results are temporarily stored in a file (
$SPLUNK_HOME/var/spool/splunk/<savedsearch_name>_<random-number>.stash). From the file, Splunk uses the addinfo command to add general information about the current search and the fields you specify during configuration to each result. Splunk then indexes the resulting event data in the summary index that you've designated for it (
index=summary by default).
Note: Use the addinfo command to add fields containing general information about the current search to the search results going into a summary index. General information added about the search helps you run reports on results you place in a summary index.
Summary indexing of data without timestamps
To set the time for summary index events, Splunk uses the following information, in this order of precedence:
_time value of the event being summarized
2. The earliest (or minimum) time of the search
3. The current system time (in the case of an "all time" search, where no "earliest" value is specified)
In the majority of cases, your events will have timestamps, so the first method of discerning the summary index timestamp holds. But if you are summarizing data that doesn't contain an
_time field (such as data from a lookup), the resulting events will have the timestamp of the earliest time of the search.
For example, if you summarize the lookup "asset_table" every night at midnight, and the asset table does not contain an
_time column, tonight's summary will have an
_time value equal to the earliest time of the search. If I have set the time range of the search to be between
+0s, each summarized event will have an
_time value of
now()-86400 (that's the start time of the search minus 86,400 seconds, or 24 hours). This means that every event without an
_time field value that is found by this summary-index-populating search will be given the exact same
_time value: the search's earliest time.
The best practice for summarizing data without a time stamp is to manually create an
_time value as part of your search. Following on from the example above:
|inputlookup asset_table | eval _time=now()
Fields added to summary-indexed data by the si- summary indexing commands
When you run searches with the
si* commands in order to populate a summary index, Splunk adds a set of special fields to the summary index data that all begin with
psrsvd, such as
psrsvd_v and so on. When you run a search agains the summary index with reporting commands like
stats, Splunk uses the
psrsvd* fields to calculate results for tables and charts that are statistically correct.
psrsvd stands for "prestats reserved."
psrsvd types present information about a specific field in the original (pre-summary indexing) file in the dataset, altough some
psrsvd types are not scoped to a single field. The general pattern is
psrsvd_[type]_[fieldname]. For example,
psrsvd_ct_bytes presents count information for the
Here's a list of the available
gc= group count (the count for a stats "grouping," not scoped to a single field.
nc= numerical count (number of numerical values)
ss= sum of squares
v= version (not scoped to a single field)
vt= value type (contains the precision of the associated field)