Use summary indexing for increased reporting efficiency

Use summary indexing to efficiently search on large volumes of statistical data. With summary indexing, you set up a frequently-running search that extracts the precise statistical information you want. Each time this search is run, its results are saved into a summary index that you designate. You can then run searches and reports on this significantly smaller (and thus seemingly "faster") summary index. And what's more, these reports will be statistically accurate because of the frequency of the index-populating search (for example, if you want to manually run searches that cover the past seven days, you might run them on a summary index that is updated on an hourly basis).

Summary indexing allows the cost of a computationally expensive report to be spread over time. In the example we've been discussing, the hourly search to populate the summary index with the previous hour's worth of data would take a fraction of a minute. Generating the complete report without the benefit of summary indexing would take approximately 168 (7 days * 24 hrs/day) times longer.

Perhaps an even more important advantage of summary indexing is its ability to amortize costs over different reports, as well as for the same report over a different but overlapping time range. The same summary data generated on a Tuesday can be used for a report of the previous 7 days done on the Wednesday, Thursday, or the following Monday. It could also be used for a monthly report that needed the average response size per day.

Summary indexing volume is not counted against your license, even if you have multiple summary indexes.

All events in a summary index have stash as their default source type. If you use a command like collect to change their source type to anything other than stash, you will incur license usage charges for those events.

Summary indexing use cases

Example #1 - Run reports over long time ranges for large datasets more efficiently: You're using Splunk Enterprise at a company that indexes tens of millions of events--or more--per day. You want to set up a dashboard for your employees that, among other things, displays a report that shows the number of page views and visitors each of your Web sites had over the past 30 days, broken out by site.

You could run this report on your primary data volume, but its runtime would be quite long, because Splunk software has to sort through a huge number of events that are totally unrelated to web traffic in order to extract the desired data. But that's not all--the fact that the report is included in a popular dashboard means it'll be run frequently, and this could significantly extend its average runtime, leading to a lot of frustrated users.

But if you use summary indexing, you can set up a saved search that collects website page view and visitor information into a designated summary index on a weekly, daily, or even hourly basis. You'll then run your month-end report on this smaller summary index, and the report should complete far faster than it would otherwise because it is searching on a smaller and better-focused dataset.

Example #2 - Building rolling reports: Say you want to run a report that shows a running count of an aggregated statistic over a long period of time--a running count of downloads of a file from a Web site you manage, for example.

First, schedule a saved search to return the total number of downloads over a specified slice of time. Then, use summary indexing to save the results of that search into a summary index. You can then run a report any time you want on the data in the summary index to obtain the latest count of the total number of downloads.

Considerations for summary index searches

Ideally, summary indexes should be populated by searches with transforming commands that return statistical data in table format. In addition, these statistical searches should not include the _raw field in their results.

You can base summary indexes on searches that return events, but getting them to work correctly can be tricky.

If your summary-populating search includes the _raw field in its results, the Splunk software focuses on reparsing the _raw strings and ignores other fields associated with those strings, including _time. Summarized data without _time fields is difficult to search.

Use the summary indexing transforming commands

Use the summary indexing transforming commands to simplify the summary index design process.

Use the summary indexing transforming commands when you define the search that populates the summary index. If you use these commands you can use the same search string that you use for the search that you eventually run on the summary index, with the exception that you use regular transforming commands in the latter search.

You do not have to use the si* summary index search commands if you are proficient with the "old-school" way of creating summary-index-populating searches. If you create summary indexes using those methods and they work for you there's no need to update them. In fact, they may be more efficient: there are performance impacts related to the use of the si* commands, because they create slightly larger indexes than the "manual" method does.

In most cases the impact is insignificant, but you may notice a difference if the summary indexes you are creating are themselves fairly large. You may also notice performance issues if you're setting up several searches to report against an index populated by an si* command search.

See the following section if you're interested in designing summary indexes without the help of si* search commands.

Define index-populating searches without the special commands

In previous versions of Splunk Enterprise you had to be very careful about how you designed the searches that you used to populate your summary index, especially if the search you wanted to run on the finished summary index involved aggregate statistics, because it meant that you had to carefully set up the "index-populating" search in a way that did not provide incorrect results. For example, if you wanted to run a search on the finished summary index that gave you average response times broken out by server, you'd want to set up a summary-index-populating search that:

is scheduled to run on a more frequent basis than the search you plan to run against the summary index
samples a larger amount of data than the search you plan to run against the summary index.
contains additional search commands that ensure that the index-populating search is generating a weighted average (only necessary if you are looking for an average in the first place)..

The summary index transforming commands take care of the last two points for you--they automatically determine the adjustments that need to be made so that your summary index is populated with data that does not produce statistically inaccurate results. However, you still should arrange for the summary-index-populating search to run on a more frequent basis than the search that you later run against the summary index.

Interested in setting up summary indexes without the si* commands? Find out about the addinfo, collect, and overlap commands, learn how to devise searches that provide weighted averages, and review an example of summary index configuration via savedsearches.conf in the topic "Configure summary indexes," in this manual.

Summary indexing transforming command usage example

Let's say you've been running the following search, with a time range of the past year:

eventtype=firewall | top src_ip

This search gives you the top source ips for the past year, but it takes forever to run because it scans across your entire index each time.

What you need to do is create a summary index that is composed of the top source IPs from the "firewall" event type. You can use the following search to build that summary index. You would schedule it to run on a daily basis, collecting the top src_ip values for only the previous 24 hours each time. The results of each daily search are added to an index named "summary":

eventtype=firewall | sitop src_ip

Summary-index-populating searches are statistically more accurate if you schedule them to run and sample information on a more frequent basis than the searches you plan to run against the finished summary index. So in this example, because we plan to run searches that cover a timespan of a year, we set up a summary-index-populating search that samples information on a daily basis.

When you define summary-index-populating searches, do not pipe other search operators after the main summary indexing transforming command. In other words, don't include additional | eval commands and the like. Save the extra search operators for the searches you run against the summary indexes, not the search you use to populate it.

The results from a summary-indexing optimized search are stored in a special format that cannot be modified before the final transformation is performed. This means that if you populate a summary index with ... | sistats <args>, the only valid retrieval of the data is: index=<summary> source=<saved search name> | stats <args>. The search against the summary index cannot create or modify fields before the | stats <args> command.

Now, let's say you save this search with the name "Summary - firewall top src_ip" (all saved summary-index-populating searches should have names that identify them as such). After your summary index is populated with results, search and report against that summary index using a search that specifies the summary index and the name of the search that you used to populate it. For example, this is the search you would use to get the top source_ips over the past year:

index=summary search_name="summary - firewall top src_ip" |top src_ip

Because this search specifies the search name, it filters out other data that have been placed in the summary index by other summary indexing searches. This search should run fairly quickly, even if the time range is a year or more.

If you are running a search against a summary index that queries for events with a specific sourcetype value, be aware that you need to use orig_sourcetype instead. So instead of running a search against a summary index like ...|stats avg(ip) by sourcetype, use ...|stats avg(ip) by orig_sourcetype.

Why do you have to do this? When events are gathered into a summary index, their sourcetype values are changed to stash. The Splunk software moves the original sourcetype values to orig_sourcetype

Set up summary index searches in Splunk Web

In Splunk Web, you can enable summary indexing for scheduled searches and identify the summary indexes that they populate.

Prerequisites

Review the following topics.

Create and edit reports in the Reporting Manual
Schedule reports in the Reporting Manual
Manage summary index gaps
Create custom indexes in the Managing Indexers and Clusters of Indexers manual

Steps

Create and save a report that you want to use to populate a summary index. The search string for the report should use the si* summary index transforming commands.
Select Settings > Searches, Reports, and Alerts.
Locate the report that you just created and select Edit > Edit Schedule for it.
Schedule the report to run on an appropriate interval for its summary index. Make sure that it is scheduled so that it has no data gaps or overlaps.
The summary-index-populating report should have an interval that is smaller than the time range of the searches you plan to run against the summary index. This practice helps to ensure that you get statistically accurate results from the searches that you run against the summary index.
For example, if you plan to run searches against a summary index that return results for the last week, you should populate that summary index with the results of a report that runs on an hourly interval, returning results for the last hour. If you want to run searches against a summary index over the past year of data, arrange for the summary index to collect data on a daily basis for the past day.
Click Next and then click Save.
You do not need to enable schedule actions for reports that populate summary indexes.
Select Edit > Edit Summary Indexing for the report you just scheduled.
Select Enable Summary Indexing.
Select a summary index. The default summary index is named summary. The list only displays indexes to which you have permission to write.
It is a best practice to have summary indexes that are dedicated to different types of data. Consider creating a custom index if your search returns data that does not match the data in the available set of indexes.
(Optional) Use Add Fields to add one or more field/value pairs to the summary index definition.
The Splunk software annotates events added to the summary index by this search with the field/value pairs that you supply. This enables you to search on these events. For example, you could add the name of the report that populates the summary index (report=summary_firewall_top_src_ip) to the events in your summary. Later, if you want to restrict a search of the summary index to events added by this search, you can add report=summary_firewall_top_src_ip to its SPL.

Schedule the populating report to avoid data gaps and overlaps

To minimize data gaps and overlaps you should be sure to set appropriate intervals and delays in the schedules of reports you use to populate summary indexes.

Gaps in a summary index are periods of time when a summary index fails to index events. Gaps can occur if:

The summary-index-populating report takes too long to run and runs past the next scheduled run time. For example, if you were to schedule the report that populates the summary to run every 5 minutes when that report typically takes around 7 minutes to run, you would have problems, because the search won't run again when it's still running a preceding report.
You have forced the summary-index-populating report to use real-time scheduling. You do this by mistakenly changing the report definition in savedsearches.conf so that the realtime_schedule attribute is set to 1, enabling real-time scheduling. This setting can result in data collection gaps if you are concurrently running several reports. When you define a summary-index-populating scheduled report in Splunk Web by selecting Enable for summary indexing and saving the report, realtime_schedule is set to 0 to ensure that the report never skips a scheduled run. For more information see Configure the priority of scheduled reports, in the Reporting Manual.
splunkd goes down. If Splunk Enterprise can't index events, you will have gaps in your summary indexes.

Overlaps are events in a summary index (from the same report) that share the same timestamp. Overlapping events skew reports and statistics created from summary indexes. Overlaps can occur if you set the time range of a report to be longer than the frequency of the schedule of the report. In other words, don't arrange for a report that runs hourly to gather data for the past 90 minutes.

For information about detecting and fixing overlapping data and gaps in data, see "Manage summary index gaps and overlaps" in this manual.

How summary indexing works

In Splunk Web, summary indexing is an alert option for scheduled saved searches. When you run a saved search with summary indexing turned on, its search results are temporarily stored in a file as follows:

$SPLUNK_HOME/var/spool/splunk/<MD5_hash_of_savedsearch_name>_<random-number>.stash_new

MD5 hashes of search names are used to cover situations where the search name is overlong.

From the file, Splunk software uses the addinfo command to add general information about the current search and the fields you specify during configuration to each result. Splunk Enterprise then indexes the resulting event data in the summary index that you've designated for it (index=summary by default).

Use the addinfo command to add fields containing general information about the current search to the search results going into a summary index. General information added about the search helps you run reports on results you place in a summary index.

Summary indexing of data without timestamps

To set the time for summary index events, Splunk software uses the following information, in this order of precedence:

The _time value of the event being summarized.
The earliest (or minimum) time of the scheduled search that populates the summary index. For example, if the summary-index-populating search covers the two minutes preceding each launch of its search, its earliest time is -2m.
The current system time (in the case of an "all time" search, where no "earliest" value is specified).

In the majority of cases, your events will have timestamps, so the first method of discerning the summary index timestamp holds. But if you are summarizing data that doesn't contain an _time field (such as data from a lookup), the resulting events will have the timestamp of the earliest time of the summary-index-populating search.

For example, if you summarize the lookup "asset_table" every night at midnight, and the asset table does not contain an _time column, tonight's summary will have an _time value equal to the earliest time of the search. If I have set the time range of the search to be between -24h and +0s, each summarized event will have an _time value of now()-86400 (that's the start time of the search minus 86,400 seconds, or 24 hours). This means that every event without an _time field value that is found by this summary-index-populating search will be given the exact same _time value: the search's earliest time.

If you base a summary index on a search that returns events instead of statistics, and if the _raw field exists in those events, the summary indexing process focuses on parsing the _raw fields and ignores the _time fields.

The best practice for summarizing data without a time stamp is to manually create an _time value as part of your search. Following on from the example above:

|inputlookup asset_table | eval _time=now()

Fields added to summary-indexed data by the si* summary indexing commands

Use of these fields and their encoded data by any search commands other than the si* summary indexing commands is unsupported. The format and content of these fields can change at any time without warning.

When you run searches with the si* commands in order to populate a summary index, Splunk software adds a set of special fields to the summary index data that all begin with psrsvd, such as psrsvd_ct_bytes and psrsvd_v and so on. When you run a search against the summary index with transforming commands like chart, timechart, and stats, the psrsvd* fields are used to calculate results for tables and charts that are statistically correct. psrsvd stands for "prestats reserved."

Most psrsvd types present information about a specific field in the original (pre-summary indexing) file in the dataset, although some psrsvd types are not scoped to a single field. The general pattern is psrsvd_[type]_[fieldname]. For example, psrsvd_ct_bytes presents count information for the bytes field.

Here is a list of the available psrsvd types:

ct = count
gc = group count (the count for a stats "grouping," not scoped to a single field.
nc = numerical count (number of numerical values)
nn = minimum numerical value
nx = maximum numerical value
rd = rdigest of values (values a the number of times they appear)
sm = sum
sn = minimum lexicographical value
ss = sum of squares
sx = maximum lexicographical value
v = version (not scoped to a single field)
vm = value map (all distinct values for the field and the number of times they appear)
vt = value type (contains the precision of the associated field)

Lexicographical order

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
Uppercase letters are sorted before lowercase letters.
Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Use summary indexing for increased reporting efficiency

Summary indexing use cases

Considerations for summary index searches

Use the summary indexing transforming commands

Define index-populating searches without the special commands

Summary indexing transforming command usage example

Set up summary index searches in Splunk Web

Schedule the populating report to avoid data gaps and overlaps

How summary indexing works

Summary indexing of data without timestamps

Fields added to summary-indexed data by the si* summary indexing commands

Lexicographical order

Comments

Use summary indexing for increased reporting efficiency

Was this topic useful?