Splunk Cloud Platform

Knowledge Manager Manual

Design searches that populate summary events indexes

This topic does not apply to summary metrics indexes.

Splunk administrators typically decide to create a summary index when they have a transforming search that tends to complete slowly. This happens because it has to run over a large dataset over a long range of time, often to pick out a small slice of that data.

You fix this not by changing the search, but by changing the source of the data. Instead of running that search over a huge and varied index, you instead run it over a summary index that contains only those events (or metrics, if you have created a summary metrics index) that are relevant to the search.

If your intent is to create a summary events index, you need to design another search that is identical to the original search, but which replaces the ordinary transforming command in the search (such as stats, chart, or timechart) with a command form the si* family of summary indexing transforming commands: sistats, sichart, sitimechart, sitop, and sirare.

Why use the si* commands? The si* commands perform a bit of extra work to ensure that the summary index returns statistically accurate results for the searches you run against it. If you decide not to use the si* commands you need to manually calibrate the search to ensure that it is sampling the correct amount of data (and calculating weighted averages, if the search involves averages). For more information about setting up summary indexes the hard way, see Configure summary events indexes.

For an overview of summary indexing, see Use summary indexing for increased search efficiency.

Example of a search for the purpose of populating a summary events index

Let's say you've been running the following search over a typical events index, with a time range of one year. Furthermore, let's say that this search completes slowly because of the wide timerange and the fact that that index is a very large and varied dataset.

eventtype=firewall | top src_ip

You need to create a summary index that is composed of the top source IPs from the "firewall" event type. You can use the following search to build that summary index. You would schedule it to run on a daily basis, collecting the top src_ip values for only the previous 24 hours each time. It adds the results of each daily search to an index named "summary_src_ip".

eventtype=firewall | sitop src_ip

Now, let's say you save this search with the name "Summary - firewall top src_ip" (all saved summary-index-populating searches should have names that identify them as such). After your summary index is populated with results, search and report against that summary index using a search that specifies the summary index and the name of the search that you used to populate it. For example, this is the search you would use to get the top source_ips over the past year:

index=summary search_name="summary - firewall top src_ip" | top src_ip

Because this search specifies the search name, it filters out other data that have been placed in the summary index by other summary indexing searches. This search should complete much faster–even with a one year time range–because it is searching over a smaller, more focused dataset.

Considerations for summary events index searches

When you create a search that will populate a summary events index with its results, there are a few things you should know.

The search should return statistical data in a table format, and the _raw field should not be present in the results.
If your summary-populating search includes the _raw field in its results, the Splunk software focuses on reparsing the _raw strings and ignores other fields associated with those strings, including _time. Summarized data without _time fields is difficult to search.
You can base summary event indexes on searches that return events, but getting them to work correctly can be tricky.
The search should not have other search operators after the transforming si* command.
Do not include additional commands such as eval. Save the extra search operators for the searches you plan to run against the summary index.
The results from a summary-indexing optimized search are stored in a special format that cannot be modified before the final transformation is performed.
If you populate a summary index with ... | sistats <args>, the only valid retrieval of the data is: index=<summary> source=<saved search name> | stats <args>. The search against the summary index cannot create or modify fields before the | stats <args> command.
If you are running a search against a summary index that queries for events with a specific sourcetype value, use orig_sourcetype instead.
When the Splunk software gathers events into a summary events index, it changes all sourcetype values to stash. The Splunk software moves the original sourcetype values to orig_sourcetype.
So, instead of running a search against a summary index like ...|stats avg(ip) by sourcetype, use ...|stats avg(ip) by orig_sourcetype.

Fields added to summary-indexed data by the si* summary indexing commands

Use of these fields and their encoded data by any search commands other than the si* summary indexing commands is unsupported. The format and content of these fields can change at any time without warning.

When you run searches with the si* commands in order to populate a summary index, Splunk software adds a set of special fields to the summary index data that all begin with psrsvd, such as psrsvd_ct_bytes and psrsvd_v and so on. When you run a search against the summary index with transforming commands like chart, timechart, and stats, the psrsvd* fields are used to calculate results for tables and charts that are statistically correct. psrsvd stands for "prestats reserved."

Most psrsvd types present information about a specific field in the original (pre-summary indexing) file in the dataset, although some psrsvd types are not scoped to a single field. The general pattern is psrsvd_[type]_[fieldname]. For example, psrsvd_ct_bytes presents count information for the bytes field.

Here is a list of some of the available psrsvd types:

  • ct = count
  • et = earliest time
  • gc = group count (the count for a stats "grouping," not scoped to a single field)
  • lt = latest time
  • nc = numerical count (number of numerical values)
  • nn = minimum numerical value
  • nx = maximum numerical value
  • rd = rdigest of values (values a the number of times they appear)
  • sm = sum
  • sn = minimum lexicographical value
  • ss = sum of squares
  • sx = maximum lexicographical value
  • v = version (not scoped to a single field)
  • vm = value map (all distinct values for the field and the number of times they appear)
  • vt = value type (contains the precision of the associated field)

Lexicographical order

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

  • Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
  • Uppercase letters are sorted before lowercase letters.
  • Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Summary indexing of data without timestamps

To set the time for summary index events, Splunk software uses the following information, in this order of precedence:

  1. The _time value of the event being summarized.
  2. The earliest (or minimum) time of the scheduled search that populates the summary index. For example, if the summary-index-populating search covers the two minutes preceding each launch of its search, its earliest time is -2m.
  3. The current system time (in the case of an "all time" search, where no "earliest" value is specified).

In the majority of cases, your events will have timestamps, so the first method of discerning the summary index timestamp holds. But if you are summarizing data that doesn't contain an _time field (such as data from a lookup), the resulting events will have the timestamp of the earliest time of the summary-index-populating search.

For example, if you summarize the lookup "asset_table" every night at midnight, and the asset table does not contain an _time column, tonight's summary will have an _time value equal to the earliest time of the search. If I have set the time range of the search to be between -24h and +0s, each summarized event will have an _time value of now()-86400: the start time of the search minus 86,400 seconds, or 24 hours. This means that every event without an _time field value that is found by this summary-index-populating search is given the exact same _time value: the search's earliest time.

If you base a summary events index on a search that returns events instead of statistics, and if the _raw field exists in those events, the summary indexing process focuses on parsing the _raw fields and ignores the _time fields.

The best practice for summarizing events without a time stamp is to have your search add a _time value to each event:

|inputlookup asset_table | eval _time=now()

Last modified on 13 June, 2024
Create a summary index in Splunk Web   Manage summary index gaps

This documentation applies to the following versions of Splunk Cloud Platform: 9.3.2408, 8.2.2112, 8.2.2201, 8.2.2202, 8.2.2203, 9.0.2205, 9.0.2208, 9.0.2209, 9.0.2303, 9.0.2305, 9.1.2308, 9.1.2312, 9.2.2403, 9.2.2406 (latest FedRAMP release)


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters