Increase reporting efficiency with summary indexing
This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Contents
Increase reporting efficiency with summary indexing
Running a report on a large dataset for a long timespan can be quite time consuming. If you only have to do this on an occasional basis it might not be a big problem for you. But running such reports on a regular schedule would be impractical--and this impracticality only increases exponentially as more and more users in your organization use Splunk to run similar reports.
Use summary indexing to efficiently report on large volumes of data. With summary indexing, you define a saved search that extracts the precise information that you want to report on. You schedule this search to run periodically over a time interval that is appropriate for your needs--it could be daily, or it could be every ten minutes. Each time Splunk runs the report, it saves the results since the last time the report was run into a summary index that you designate. You can then search and run reports on this smaller summary index instead of working with the much larger dataset.
You can use summary indexing to:
- index aggregate results.
- index running statistics (such as a running total).
- index rare original events into a smaller index for more efficient reporting.
For example you may want to run a report at the end of each month that displays the number of page views and visitors each of your Web sites had, broken out by site. If you were to just run this report on your primary data volume, it would take a long time to run because Splunk has to sort through a great deal of events that have nothing to do with web traffic in order to extract the desired information. When you use summary indexing, the saved search collects the page view and visitor information into a designated summary index for you on a weekly, daily, or even hourly basis. When you run your "month-end" report, the report completes much faster because it is searching through a much smaller and better focused set of data.
Or, you may want to run a report that shows a running count of a statistic over a long period of time. For example, you may want a running count of downloads of a file from a Web site you manage. Schedule a saved search to return the total number of downloads over a specified slice of time. Use summary indexing to have Splunk save the results into a summary index. You can then run a report any time you want on the data in the summary index to obtain the latest count of the total number of downloads.
For another view into the ideas behind summary indexing, you can watch this Splunk developer video about the theory and practice of summary indexing.
Note: Indexing events in a summary index counts against your license volume. We recommend that you not index more events in your summary indexes than you really need. Consult Splunk support for specific information on license volume impact.
How summary indexing works
In Splunk Web, summary indexing is an alert option for scheduled saved searches. When you run a saved search with summary indexing turned on, its search results are temporarily stored in a file ($SPLUNK_HOME/var/spool/splunk/<savedsearch_name>_<random-number>.stash). From the file, Splunk adds general information about the current search and the fields you specify during configuration (using the addinfo command) to every result and indexes the results as events in a summary index (index=summary by default).
Note: Use the addinfo command to add fields containing general information about the current search to the search results going into a summary index. General information added about the search helps you run reports on results you place in a summary index.
After Splunk indexes results in the summary index, search and report on them by specifying the name of the summary index in your search.
Example:
This search focuses on the "summary" index and returns events from the most common referrers in that index.
* index=summary | top referrer
Configure summary indexing
Configure summary indexing as an alert option for a scheduled saved search via Splunk Web. Once you configure summary indexing for a saved search, you can further configure it via savedsearches.conf.
- Learn how to enable summary indexing for a scheduled search via Splunk Web.
- Learn how to configure summary indexing via savedsearches.conf
Note: You must enable summary indexing via Splunk Web before you can configure it in savedsearches.conf, unless you manually configure summary indexing).
Search commands useful to summary indexing
Summary indexing uses some new search commands behind the scenes to perform its actions.
- addinfo: Summary indexing uses
addinfoto to add fields containing general information about the current search to the search results going into a summary index. Add| addinfoto any search to see what results will look like if they are indexed into a summary index. - collect: Summary indexing uses
collectto index search results into the summary index. Use| collectto index any search results into another index (usingcollectcommand options).
Another useful command is overlap. You can use overlap to find gaps in events or overlapping events in a summary index.
- overlap: Use overlap to identify gaps and overlaps in a summary index.
overlapfinds events of the same query_id in a summary index with overlapping timestamp values or identifies periods of time where there are missing events.
General guidelines for summary indexing
Note: Currently, indexing events in a summary index counts against your license volume. We recommend that you not index more events in your summary indexes than you really need. Consult Splunk support for specific information on license volume impact.
Use summary indexing to:
- capture rare events in a smaller index for efficient reporting.
- build rolling reports or calculate running totals of aggregated statistics.
When using summary indexing:
- Ensure that aggregated statistics generated from results in a summary index are accurate by indexing statistics taken from the smallest possible time range. For example, if you need to generate hourly/daily/weekly reports, then you want to index hourly reports in the summary index and generate daily and weekly reports from an aggregate of the hourly reports.
- Be sure to set the proper periods and delays to scheduled searches you put in a summary index to minimize data gaps and overlaps.
- Modify your reporting searches to use summary index data instead of original (main) index data when possible.
- Use the addinfo search command to see what events will look like if you summary index them.
Aggregated statistics
Be careful when building reports made of aggregated statistics. Some aggregating statistical functions (such as distinct count, mode, median, etc.) yield incorrect results when you use them on aggregated statistics. Use one of Splunk's reporting commands to access statistical functions.
For example, if you want to build hourly/daily/weekly reports of average response times, generate the "daily average" by averaging the "hourly averages" together. The daily average becomes skewed if there aren't the same number of events in each "hourly average". Get the correct "daily average" by using a weighted average function.
Example:
The following expression calculates the the daily average response time correctly (a weighted average) using stats and eval.
| stats sum(hourly_resp_time_sum) as resp_time_sum, sum(hourly_resp_time_count) as resp_time_count | eval daily_average= resp_time_sum/resp_time_count | .....
Gaps and overlaps
Gaps
Gaps in a summary index are periods of time when a summary index fails to index events. Gaps can occur if:
-
splunkdgoes down. - the scheduled saved search (the one being summary indexed) takes too long to run and runs past the next scheduled run time. For example, if a scheduled search is scheduled to run every 5 minutes but the search takes 7 minutes to run, the search won't run again if it's still running from the last time.
Overlaps
Overlaps are events in a summary index (from the same search) that share the same timestamp. Overlapping events skew reports and statistics created from summary indexes. Overlaps can occur if you set the time range of a saved search to be longer than the frequency of the schedule of the search, or you run summary indexing manually (using | collect).
Identify gaps and overlaps in data
Identify overlaps and gaps in a summary index using the "Summary Index Gaps and Overlaps" form search (a default saved search in the main Splunk dashboard), or by using the overlap command in your search (add | overlap at the end of the search that produces overlaps).
If you run the form search Summary Index Gaps and Overlaps, specify the time range using the form, or switch to a "text" display where you must specify the following parameters in the search bar (following | overlap):
either specify:
- StartTime: Time to start searching for missing entries,
starttime= mm/dd/yyyy:hh:mm:ss(for example: 05/20/2008:00:00:00). - EndTime: Time to stop searching for missing entries,
endtime= mm/dd/yyyy:hh:mm:ss(for example: 05/22/2008:00:00:00).
or:
- Period: Specify the length of time period to search,
period=<integer>[smhd](for example: 5m). - SavedSearchName: Specify the name of the saved search to search for missing events with
savedsearchname=string(NO wildcards).
If you identify a gap, you can run your scheduled saved search over the period of the gap and summary index the results (using | collect). If you identify overlapping events, you can manually delete the overlaps from the summary index by using the search language.
This documentation applies to the following versions of Splunk: 3.3 , 3.3.1 , 3.3.2 , 3.3.3 , 3.3.4 , 3.4 , 3.4.1 , 3.4.2 , 3.4.3 , 3.4.5 , 3.4.6 , 3.4.8 , 3.4.9 , 3.4.10 , 3.4.11 , 3.4.12 , 3.4.13 , 3.4.14 View the Article History for its revisions.