Design searches that populate summary events indexes
This topic does not apply to summary metrics indexes.
Splunk administrators typically decide to create a summary index when they have a transforming search that tends to complete slowly. This happens because it has to run over a large dataset over a long range of time, often to pick out a small slice of that data.
You fix this not by changing the search, but by changing the source of the data. Instead of running that search over a huge and varied index, you instead run it over a summary index that contains only those events (or metrics, if you have created a summary metrics index) that are relevant to the search.
If your intent is to create a summary events index, you need to design another search that is identical to the original search, but which replaces the ordinary transforming command in the search (such as stats, chart, or timechart) with a command form the si*
family of summary indexing transforming commands: sistats
, sichart
, sitimechart
, sitop
, and sirare
.
Why use the si*
commands? The si*
commands perform a bit of extra work to ensure that the summary index returns statistically accurate results for the searches you run against it. If you decide not to use the si*
commands you need to manually calibrate the search to ensure that it is sampling the correct amount of data (and calculating weighted averages, if the search involves averages). For more information about setting up summary indexes the hard way, see Configure summary events indexes.
For an overview of summary indexing, see Use summary indexing for increased search efficiency.
Example of a search for the purpose of populating a summary events index
Let's say you've been running the following search over a typical events index, with a time range of one year. Furthermore, let's say that this search completes slowly because of the wide timerange and the fact that that index is a very large and varied dataset.
eventtype=firewall | top src_ip
You need to create a summary index that is composed of the top source IPs from the "firewall" event type. You can use the following search to build that summary index. You would schedule it to run on a daily basis, collecting the top src_ip
values for only the previous 24 hours each time. It adds the results of each daily search to an index named "summary_src_ip".
eventtype=firewall | sitop src_ip
Now, let's say you save this search with the name "Summary - firewall top src_ip" (all saved summary-index-populating searches should have names that identify them as such). After your summary index is populated with results, search and report against that summary index using a search that specifies the summary index and the name of the search that you used to populate it. For example, this is the search you would use to get the top source_ips over the past year:
index=summary search_name="summary - firewall top src_ip" | top src_ip
Because this search specifies the search name, it filters out other data that have been placed in the summary index by other summary indexing searches. This search should complete much faster–even with a one year time range–because it is searching over a smaller, more focused dataset.
Considerations for summary events index searches
When you create a search that will populate a summary events index with its results, there are a few things you should know.
- The search should return statistical data in a table format, and the
_raw
field should not be present in the results. - If your summary-populating search includes the
_raw
field in its results, the Splunk software focuses on reparsing the_raw
strings and ignores other fields associated with those strings, including_time
. Summarized data without_time
fields is difficult to search.
- You can base summary event indexes on searches that return events, but getting them to work correctly can be tricky.
- The search should not have other search operators after the transforming
si*
command. - Do not include additional commands such as
eval
. Save the extra search operators for the searches you plan to run against the summary index.
- The results from a summary-indexing optimized search are stored in a special format that cannot be modified before the final transformation is performed.
- If you populate a summary index with
... | sistats <args>
, the only valid retrieval of the data is:index=<summary> source=<saved search name> | stats <args>
. The search against the summary index cannot create or modify fields before the| stats <args>
command.
- If you are running a search against a summary index that queries for events with a specific
sourcetype
value, useorig_sourcetype
instead. - When the Splunk software gathers events into a summary events index, it changes all
sourcetype
values tostash
. The Splunk software moves the original sourcetype values toorig_sourcetype
.
- So, instead of running a search against a summary index like
...|stats avg(ip) by sourcetype
, use...|stats avg(ip) by orig_sourcetype
.
Fields added to summary-indexed data by the si* summary indexing commands
Use of these fields and their encoded data by any search commands other than the si*
summary indexing commands is unsupported. The format and content of these fields can change at any time without warning.
When you run searches with the si*
commands in order to populate a summary index, Splunk software adds a set of special fields to the summary index data that all begin with psrsvd
, such as psrsvd_ct_bytes
and psrsvd_v
and so on. When you run a search against the summary index with transforming commands like chart
, timechart
, and stats
, the psrsvd*
fields are used to calculate results for tables and charts that are statistically correct. psrsvd
stands for "prestats reserved."
Most psrsvd
types present information about a specific field in the original (pre-summary indexing) file in the dataset, although some psrsvd
types are not scoped to a single field. The general pattern is psrsvd_[type]_[fieldname]
. For example, psrsvd_ct_bytes
presents count information for the bytes
field.
Here is a list of some of the available psrsvd
types:
ct
= countet
= earliest timegc
= group count (the count for a stats "grouping," not scoped to a single field)lt
= latest timenc
= numerical count (number of numerical values)nn
= minimum numerical valuenx
= maximum numerical valuerd
= rdigest of values (values a the number of times they appear)sm
= sumsn
= minimum lexicographical valuess
= sum of squaressx
= maximum lexicographical valuev
= version (not scoped to a single field)vm
= value map (all distinct values for the field and the number of times they appear)vt
= value type (contains the precision of the associated field)
Lexicographical order
Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.
- Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
- Uppercase letters are sorted before lowercase letters.
- Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.
Summary indexing of data without timestamps
To set the time for summary index events, Splunk software uses the following information, in this order of precedence:
- The
_time
value of the event being summarized. - The earliest (or minimum) time of the scheduled search that populates the summary index. For example, if the summary-index-populating search covers the two minutes preceding each launch of its search, its earliest time is -2m.
- The current system time (in the case of an "all time" search, where no "earliest" value is specified).
In the majority of cases, your events will have timestamps, so the first method of discerning the summary index timestamp holds. But if you are summarizing data that doesn't contain an _time
field (such as data from a lookup), the resulting events will have the timestamp of the earliest time of the summary-index-populating search.
For example, if you summarize the lookup "asset_table" every night at midnight, and the asset table does not contain an _time
column, tonight's summary will have an _time
value equal to the earliest time of the search. If I have set the time range of the search to be between -24h
and +0s
, each summarized event will have an _time
value of now()-86400
: the start time of the search minus 86,400 seconds, or 24 hours. This means that every event without an _time
field value that is found by this summary-index-populating search is given the exact same _time
value: the search's earliest time.
If you base a summary events index on a search that returns events instead of statistics, and if the _raw
field exists in those events, the summary indexing process focuses on parsing the _raw
fields and ignores the _time
fields.
The best practice for summarizing events without a time stamp is to have your search add a _time
value to each event:
|inputlookup asset_table | eval _time=now()
Create a summary index in Splunk Web | Manage summary index gaps |
This documentation applies to the following versions of Splunk® Enterprise: 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.1.13, 8.1.14, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 8.2.10, 8.2.11, 8.2.12, 9.0.0, 9.0.1, 9.0.2, 9.0.3, 9.0.4, 9.0.5, 9.0.6, 9.0.7, 9.0.8, 9.0.9, 9.0.10, 9.1.0, 9.1.1, 9.1.2, 9.1.3, 9.1.4, 9.1.5, 9.1.6, 9.2.0, 9.2.1, 9.2.2, 9.2.3, 9.3.0, 9.3.1
Feedback submitted, thanks!