Best practices for implementing Event Analytics in ITSI
Consider the following best practices when configuring Event Analytics in IT Service Intelligence (ITSI).
As general guidance, we recommend that you follow these scale limits when implementing Event Analytics for best performance:
- 20 notable event aggregation policies with time-based conditions
- 20 notable event aggregation policies without time-based conditions
- 10k active episodes
- 10k notable events generated by a correlation search per minute
For more information about time-based action rules, see Activation criteria.
Best practices for implementing Event Analytics for ITSI services and KPIs
For best practices around leveraging ITSI's Event Analytics functionality to translate service and KPI health into notable events and episodes, see About the Content Pack for Monitoring and Alerting. The content pack provides a set of preconfigured correlation searches and notable event aggregation policies which, when enabled, produce meaningful and actionable alerts.
For more information about ITSI Event Analytics, see Overview of Event Analytics in ITSI.
Best practices for implementing Event Analytics for other data sources
The following best practices help you successfully ingest and aggregate third-party alerts in ITSI. For more information, see Ingest third-party alerts into ITSI.
To avoid duplicate events, use the same frequency and time range in correlation searches
When configuring a correlation search, consider using the same value for the search frequency and time range to avoid duplicate events. For example, a search might run every five minutes and also look back every five minutes.
If there's latency in your data and you need to look for events you might have missed, consider expanding the time range. For example, the search could run every minute but look back 5 minutes.
To reduce load on your system, don't use a time range greater than 5 minutes
Exceeding a calculation window of 5 minutes can put a lot of load on your system, especially if you have a lot of events coming in. If you want to avoid putting extra load on your system, consider reducing the time range to 5 minutes or less. One exception is if your data is coming in more sporadically. For example, if your data comes in every 15 minutes, consider using a 15-minute time range.
Normalize all the important fields in your third-party events
When you're creating correlation searches, don't only normalize on obvious fields that exist in a lot of data sources, like host, severity, event type, message, and so on. It's also important to normalize fields that you know are important in your events. For example, when you're looking at Windows event logs, what do you look at to know if something is good or bad? Normalize those fields as well and use them to build out a common information model.
Perform this normalization process for every data source you have so you can easily identify important fields when creating aggregation policies.
Use universal alerting to onboard external alert sources
Universal alerting simplifies and speeds up the process of onboarding external alert sources (such as Nagios or SolarWinds). Use the universal correlation search that is part of the Content Pack for Monitoring and Alerting to find and onboard external alerts to generate notable events and episodes. For more information, see About Universal Alerting in the Content Pack for ITSI Monitoring and Alerting.
Create one correlation search per data source
If you opt not to use universal alerting, for every third-party data source you're bringing into ITSI, create a single correlation search to normalize those fields and generate notable events. For example, one for SCOM, one for SolarWinds, and so on.
You can delete ITSI services that you no longer want to monitor, but to keep that action from also disabling or deleting the correlation search of which a service was part, you should manually remove the service from a correlated search before deleting that service.
Don't create too many aggregation policies
Limit the number of aggregation policies you enable in your environment. Too many aggregation policies create too many groups, which produces an overly granular view of your IT environment. By limiting the number of policies, you create more end-to-end visibility and avoid creating silos of collaboration between groups in your organization. Make sure to group events according to how those events are related, not based on how people work to resolve those issues. We recommend a maximum of 20 time-based aggregation policies, and 20 aggregation policies that are not time-based. Time-based policies are any aggregation policies with a breaking criteria or action rule that include a condition related to time, such as a the number of seconds that an episode has existed. Too many time-based policies can cause performance concerns with the Rules Engine. Make sure to group events according to how those events are related, not based on how people work to resolve those issues.
Only select 5-10 fields for Smart Mode analysis
By default, when selecting fields to analyze for event similarity, Smart Mode selects any fields that have good event coverage. As a best practice, begin by unchecking all the boxes. Then select 5-10 fields that you've normalized in a correlation search.
Selecting between 5-10 fields ensures that you generate an appropriate size and quantity of episodes. If you select fewer than five fields, you only give the aggregation policy with a few things to look at when comparing similarity. For example, if the message of two events is somewhat similar and the location is similar, they might be grouped together. This can lead to a small number of very large groups. The opposite is also true. If you select too many fields, the chances of them all being similar is very low. This can lead to a large number of groups containing only one event.
Using fields as tokens when running alert actions
When running an alert action (for example, Send an email or Create/update ServiceNow incident), you can use the following fields from either the episode or the triggering event as token values (for example, $result.description$):
By default, the field priority is set by the macro itsi_notable_event_actions_coalesce_state_values. The macro definition prioritizes fields from episodes, instead of the event that triggered the alert action, where fields starting with the prefix
action_temp_ are fields from the event.
eval status=coalesce(status, action_temp_status) | eval owner=coalesce(owner, action_temp_owner) | eval severity=coalesce(severity,action_temp_severity) | eval title=coalesce(title, action_temp_title) | eval description=coalesce(description, action_temp_description) | fields - action_temp_*
If you want to use the event fields instead of the episode, change the order in which they are passed to the coalesce function. For example:
eval status=coalesce(action_temp_status, status) | eval owner=coalesce(action_temp_owner, owner) | eval severity=coalesce(action_temp_severity, severity) | eval title=coalesce(action_temp_title, title) | eval description=coalesce(action_temp_description, description) | fields - action_temp_*
Avoid manually triggering event grouping search
Avoid running the itsi_event_grouping search manually. Running this search can trigger duplicate Rules Engine processes, leading to duplicate episodes.
Best practices for implementing Workload Management for ITSI
When implementing admission rules in workload management, exclude ITSI by adding
AND NOT (app="SA-ITOA" OR app="itsi") at the end of your predicate condition. This ensures ITSI continues to ensure accurate search results. For more information about workload management, see Workload Management overview.
Configure Rules Engine periodic backfill in ITSI
Troubleshoot the Rules Engine and event grouping in ITSI
This documentation applies to the following versions of Splunk® IT Service Intelligence: 4.11.0, 4.11.1, 4.11.2, 4.11.3, 4.11.4, 4.11.5, 4.11.6, 4.12.0 Cloud only, 4.12.1 Cloud only, 4.12.2 Cloud only, 4.13.0, 4.13.1, 4.13.2, 4.13.3, 4.14.0 Cloud only, 4.14.1 Cloud only, 4.14.2 Cloud only, 4.15.0, 4.15.1, 4.15.2, 4.15.3, 4.16.0 Cloud only