Knowledge Manager Manual

 


Use summary indexing for increased reporting efficiency

Use summary indexing for increased reporting efficiency

Use summary indexing to efficiently report on large volumes of data. With summary indexing, you set up a search that extracts the precise information you want, on a frequent basis. Each time Splunk Enterprise runs this search it saves the results into a summary index that you designate. You can then run searches and reports on this significantly smaller (and thus seemingly "faster") summary index. And what's more, these reports will be statistically accurate because of the frequency of the index-populating search (for example, if you want to manually run searches that cover the past seven days, you might run them on a summary index that is updated on an hourly basis).

Summary indexing allows the cost of a computationally expensive report to be spread over time. In the example we've been discussing, the hourly search to populate the summary index with the previous hour's worth of data would take a fraction of a minute. Generating the complete report without the benefit of summary indexing would take approximately 168 (7 days * 24 hrs/day) times longer.

Perhaps an even more important advantage of summary indexing is its ability to amortize costs over different reports, as well as for the same report over a different but overlapping time range. The same summary data generated on a Tuesday can be used for a report of the previous 7 days done on the Wednesday, Thursday, or the following Monday. It could also be used for a monthly report that needed the average response size per day.

Summary indexing use cases

Example #1 - Run reports over long time ranges for large datasets more efficiently: You're using Splunk Enterprise at a company that indexes tens of millions of events--or more--per day. You want to set up a dashboard for your employees that, among other things, displays a report that shows the number of page views and visitors each of your Web sites had over the past 30 days, broken out by site.

You could run this report on your primary data volume, but its runtime would be quite long, because Splunk Enterprise has to sort through a huge number of events that are totally unrelated to web traffic in order to extract the desired data. But that's not all--the fact that the report is included in a popular dashboard means it'll be run frequently, and this could significantly extend its average runtime, leading to a lot of frustrated users.

But if you use summary indexing, you can set up a saved search that collects website page view and visitor information into a designated summary index on a weekly, daily, or even hourly basis. You'll then run your month-end report on this smaller summary index, and the report should complete far faster than it would otherwise because it is searching on a smaller and better-focused dataset.

Example #2 - Building rolling reports: Say you want to run a report that shows a running count of an aggregated statistic over a long period of time--a running count of downloads of a file from a Web site you manage, for example.

First, schedule a saved search to return the total number of downloads over a specified slice of time. Then, use summary indexing to have Splunk Enterprise save the results of that search into a summary index. You can then run a report any time you want on the data in the summary index to obtain the latest count of the total number of downloads.

For another view, you can watch this Splunk Enterprise developer video about the theory and practice of summary indexing.

Use the summary indexing reporting commands

If you are new to summary indexing, use the summary indexing reporting commands (sichart, sitimechart, sistats, sitop, and sirare) when you define the search that will populate the summary index. If you use these commands you can use the same search string that you use for the search that you eventually run on the summary index, with the exception that you use regular reporting commands in the latter search.

Note: You do not have to use the si- summary index search commands if you are proficient with the "old-school" way of creating summary-index-populating searches. If you create summary indexes using those methods and they work for you there's no need to update them. In fact, they may be more efficient: there are performance impacts related to the use of the si- commands, because they create slightly larger indexes than the "manual" method does.

In most cases the impact is insignificant, but you may notice a difference if the summary indexes you are creating are themselves fairly large. You may also notice performance issues if you're setting up several searches to report against an index populated by an si- command search.

See the following section if you're interested in designing summary indexes without the help of si- search commands.

Define index-populating searches without the special commands

In previous versions of Splunk Enterprise you had to be very careful about how you designed the searches that you used to populate your summary index, especially if the search you wanted to run on the finished summary index involved aggregate statistics, because it meant that you had to carefully set up the "index-populating" search in a way that did not provide incorrect results. For example, if you wanted to run a search on the finished summary index that gave you average response times broken out by server, you'd want to set up a summary-index-populating search that:

  • is scheduled to run on a more frequent basis than the search you plan to run against the summary index
  • samples a larger amount of data than the search you plan to run against the summary index.
  • contains additional search commands that ensure that the index-populating search is generating a weighted average (only necessary if you are looking for an average in the first place)..

The summary index reporting commands take care of the last two points for you--they automatically determine the adjustments that need to be made so that your summary index is populated with data that does not produce statistically inaccurate results. However, you still should arrange for the summary-index-populating search to run on a more frequent basis than the search that you later run against the summary index.

Interested in setting up summary indexes without the si- commands? Find out about the addinfo, collect, and overlap commands, learn how to devise searches that provide weighted averages, and review an example of summary index configuration via savedsearches.conf in the topic "Configure summary indexes," in this manual.

Summary indexing reporting command usage example

Let's say you've been running the following search, with a time range of the past year:

eventtype=firewall | top src_ip

This search gives you the top source ips for the past year, but it takes forever to run because it scans across your entire index each time.

What you need to do is create a summary index that is composed of the top source IPs from the "firewall" event type. You can use the following search to build that summary index. You would schedule it to run on a daily basis, collecting the top src_ip values for only the previous 24 hours each time. The results of each daily search are added to an index named "summary":

eventtype=firewall | sitop src_ip

Note: Summary-index-populating searches are statistically more accurate if you schedule them to run and sample information on a more frequent basis than the searches you plan to run against the finished summary index. So in this example, because we plan to run searches that cover a timespan of a year, we set up a summary-index-populating search that samples information on a daily basis.

Important: When you define summary-index-populating searches, do not pipe other search operators after the main summary indexing reporting command. In other words, don't include additional | eval commands and the like. Save the extra search operators for the searches you run against the summary indexes, not the search you use to populate it.

Important: The results from a summary-indexing optimized search are stored in a special format that cannot be modified before the final transformation is performed. This means that if you populate a summary index with ... | sistats <args>, the only valid retrieval of the data is: index=<summary> source=<saved search name> | stats <args>. The search against the summary index cannot create or modify fields before the | stats <args> command.

Now, let's say you save this search with the name "Summary - firewall top src_ip" (all saved summary-index-populating searches should have names that identify them as such). After your summary index is populated with results, search and report against that summary index using a search that specifies the summary index and the name of the search that you used to populate it. For example, this is the search you would use to get the top source_ips over the past year:

index=summary search_name="summary - firewall top src_ip" |top src_ip

Because this search specifies the search name, it filters out other data that have been placed in the summary index by other summary indexing searches. This search should run fairly quickly, even if the time range is a year or more.

Note: If you are running a search against a summary index that queries for events with a specific sourcetype value, be aware that you need to use orig_sourcetype instead. So instead of running a search against a summary index like ...|stats timechart avg(ip) by sourcetype, use ...|stats timechart avg(ip) by orig_sourcetype.

Why do you have to do this? When events are gathered into a summary index, Splunk Enterprise changes their sourcetype values to "stash" and moves the original sourcetype values to orig_sourcetype.

Set up summary index searches in Splunk Web

You can set up summary index searches through Splunk Web. Summary indexing is an alert option for scheduled reports. Once you determine the report that you want to use to populate a summary index, follow these steps:

1. Navigate to System > Searches and Reports. Select the name of a report (or click New to create a new report).

2. Under Schedule and alert, select Schedule this search if the report isn't already scheduled. Schedule the report to run on an appropriate interval. Remember that searches that populate summary indexes should run on a fairly frequent basis in order to create statistically accurate final reports. If the report you're running against the summary index is gathering information for the past week, you should have the summary report run on an hourly basis, collecting information for each hour. If you're running reports over the past year's worth of data, you might have the summary index collect data on a daily basis for the past day.

Note: Be sure to schedule the report so that there are no data gaps and overlaps. For more on this see the subtopic on this issue, below.

3. Under Alert, select a Condition value of always.

4. Select an Alert mode of Once per search. This ensures that the alert will be triggered each time the report runs.

5.0-enable sum indexing.jpg

5. Under Summary indexing, select Enable.

6. Select the name of the summary index that the report will populate from the Select the summary index list. The default summary index is named summary. Only indexes that you have permission to write to are listed. You may need to create additional summary indexes if you plan to run a variety of summary index reports. For information about creating new indexes, see "Set up multiple indexes" in the Managing Indexers and Clusters manual. It's a good idea to create indexes that are dedicated to the collection of summary data.

Note: If you enter the name of an index that does not exist, Splunk Enterprise will run the report on the schedule you've defined, its data will not get saved to a summary index.

7. (Optional) Under Add fields, you can add field/value pairs to the summary index definition. These key/value pairs will be annotated to each event that gets summary indexed, making it easier to find them with later searches. For example, you could add the name of the report populating the summary index (report=summary_firewall_top_src_ip) or the name of the index that the report populates (index=summary), and then search on those terms later.

Note: You can also add field/value pairs to the summary index configuration in savedsearches.conf. For more information, see "Configure summary indexes" in the Knowledge Manager manual.

For more information about saving searches as reports and alerts, see: "Create and edit reports" (in the Reporting Manual) and "Create an alert" (in the Alerting Manual).

Schedule the populating report to avoid data gaps and overlaps

To minimize data gaps and overlaps you should be sure to set appropriate intervals and delays in the schedules of reports you use to populate summary indices.

Gaps in a summary index are periods of time when a summary index fails to index events. Gaps can occur if:

  • The summary-index-populating report takes too long to run and runs past the next scheduled run time. For example, if you were to schedule the report that populates the summary to run every 5 minutes when that report typically takes around 7 minutes to run, you would have problems, because the search won't run again when it's still running a preceding report.
  • You have forced the summary-index-populating report to use real-time scheduling. You do this by mistakenly changing the report definition in savedsearches.conf so that the realtime_schedule attribute is set to 1, enabling real-time scheduling. This setting can result in data collection gaps if you are concurrently running several reports. When you define a summary-index-populating scheduled report in Splunk Web by selecting Enable for summary indexing and saving the report, Splunk Enterprise automatically sets realtime_schedule to 0, to ensure that the report never skips a scheduled run. For more information see "Configure the priority of scheduled reports", in the Reporting Manual.
  • splunkd goes down. If Splunk Enterprise can't index events, you will have gaps in your summary indexes.

Overlaps are events in a summary index (from the same report) that share the same timestamp. Overlapping events skew reports and statistics created from summary indexes. Overlaps can occur if you set the time range of a report to be longer than the frequency of the schedule of the report. In other words, don't arrange for a report that runs hourly to gather data for the past 90 minutes.

Note: If you think you have gaps or overlaps in your summary index data, Splunk Enterprise provides methods of detecting them and either backfilling them (in the case of gaps) or deleting the overlapping events. For more information, see "Manage summary index gaps and overlaps" in this manual.

How summary indexing works

In Splunk Web, summary indexing is an alert option for scheduled saved searches. When you run a saved search with summary indexing turned on, its search results are temporarily stored in a file ($SPLUNK_HOME/var/spool/splunk/<savedsearch_name>_<random-number>.stash). From the file, Splunk Enterprise uses the addinfo command to add general information about the current search and the fields you specify during configuration to each result. Splunk Enterprise then indexes the resulting event data in the summary index that you've designated for it (index=summary by default).

Note: Use the addinfo command to add fields containing general information about the current search to the search results going into a summary index. General information added about the search helps you run reports on results you place in a summary index.

Summary indexing of data without timestamps

To set the time for summary index events, Splunk Enterprise uses the following information, in this order of precedence:

1. The _time value of the event being summarized.

2. The earliest (or minimum) time of the scheduled search that populates the summary index. For example, if the summary-index-populating search covers the two minutes preceding each launch of its search, its earliest time is -2m.

3. The current system time (in the case of an "all time" search, where no "earliest" value is specified)

In the majority of cases, your events will have timestamps, so the first method of discerning the summary index timestamp holds. But if you are summarizing data that doesn't contain an _time field (such as data from a lookup), the resulting events will have the timestamp of the earliest time of the summary-index-populating search.

For example, if you summarize the lookup "asset_table" every night at midnight, and the asset table does not contain an _time column, tonight's summary will have an _time value equal to the earliest time of the search. If I have set the time range of the search to be between -24h and +0s, each summarized event will have an _time value of now()-86400 (that's the start time of the search minus 86,400 seconds, or 24 hours). This means that every event without an _time field value that is found by this summary-index-populating search will be given the exact same _time value: the search's earliest time.

The best practice for summarizing data without a time stamp is to manually create an _time value as part of your search. Following on from the example above:

|inputlookup asset_table | eval _time=now()

Fields added to summary-indexed data by the si- summary indexing commands

When you run searches with the si* commands in order to populate a summary index, Splunk Enterprise adds a set of special fields to the summary index data that all begin with psrsvd, such as psrsvd_ct_bytes and psrsvd_v and so on. When you run a search against the summary index with reporting commands like chart, timechart, and stats, Splunk Enterprise uses the psrsvd* fields to calculate results for tables and charts that are statistically correct. psrsvd stands for "prestats reserved."

Most psrsvd types present information about a specific field in the original (pre-summary indexing) file in the dataset, altough some psrsvd types are not scoped to a single field. The general pattern is psrsvd_[type]_[fieldname]. For example, psrsvd_ct_bytes presents count information for the bytes field.

Here's a list of the available psrsvd types:

  • ct = count
  • gc = group count (the count for a stats "grouping," not scoped to a single field.
  • nc = numerical count (number of numerical values)
  • sm = sum
  • ss = sum of squares
  • v = version (not scoped to a single field)
  • vt = value type (contains the precision of the associated field)

This documentation applies to the following versions of Splunk: 6.0 , 6.0.1 , 6.0.2 , 6.0.3 View the Article History for its revisions.


Comments

I think we should talk about how SI indexes is actually created in a workflow format. For example (don't quote me, I could be wrong):

1) Create a summary search, schedule it to an index
1a) Schedule the summary search with a shorter time range than the frequency of the search and a larger sample that uses the summary
2) SH creates $SPLUNK_HOME/var/spool/splunk/_.stash temporary
3) The SH is configured somehow to act as forwarder and sinkholes/batch the stash file to the IDXs, which indexes it to the specified summary index
4) User runs a search against the summary and the IDX reads the summary index like any other search

Skawasaki splunk
January 29, 2014

You must be logged into splunk.com in order to post comments. Log in now.

Was this documentation topic helpful?

If you'd like to hear back from us, please provide your email address:

We'd love to hear what you think about this topic or the documentation as a whole. Feedback you enter here will be delivered to the documentation team.

Feedback submitted, thanks!