Splunk® IT Service Intelligence

Administer Splunk IT Service Intelligence

Download manual as PDF

Download topic as PDF

Tune notable event grouping in ITSI

Notable event aggregation polices group notable events to organize them in Episode Review. ITSI provides a file called itsi_rules_engine.properties, located at $SPLUNK_HOME/etc/apps/SA-ITOA/default/, where you can tune and customize notable event grouping settings.

Prerequisites

  • Only users with file system access, such as system administrators, can tune notable event grouping using a configuration file.
  • Review the steps in How to edit a configuration file in the Admin Manual.

Never change or copy the configuration files in the default directory. The files in the default directory must remain intact and in their original location.

Steps

  1. Open or create a local itsi_rules_engine.properties file at $SPLUNK_HOME/etc/apps/SA-ITOA/local.
  2. Modify the following settings as necessary to improve notable event grouping on your deployment.
# The period, in seconds, at which to fetch aggregation policies from the KV store.
policy_fetch_period = 45

# The HTTP token name.
token_name = itsi_group_alerts_token

# The HTTP sync token name.
# NOTE: If the sync token name and the HTTP token name are the same, a token
# with async functionality is created.
sync_token_name = itsi_group_alerts_token

# The timeout value for receiving an acknowledgement from HEC.
# When processing a notable event and the action criteria is met, this setting
# ensures that the current event is indexed before executing an action.
http_ack_time_out = 10

# The number of split-by hash keys that can exist for a single aggregation policy that splits events
# by field(s). Split-by hash keys are the possible combinations of values from individual split-by fields.
# For example, if an aggregation policy is split by 'host' and 'severity', it creates separate hash keys 
# for the host-severity combinations of host1 and severity high, host1 and severity low, host2 
# and severity low, etc. If episodes are created for 10000 different host-severity combinations, the limit is reached.
# If you exceed this limit, the hash keys and the episodes associated with them are cleared from memory. The episodes
# are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes.
# If you increase this setting, recalculate the `max_event_in_parent_group` setting and increase it accordingly.
sub_group_limit = 10000

# The number of episodes that can be created for each split-by hash key for an 
# aggregation policy that splits events by field(s). 
# If you exceed this limit, the episodes associated with the hash key are cleared from memory. The episodes
# are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes.
# If you increase this setting, recalculate the `max_event_in_parent_group` setting and increase it accordingly.
max_groups_per_sub_group = 10000

# The maximum number of events that can be contained within a single episode. 
# If you exceed this limit, the episode breaks and a new episode is created. 
# If you increase this setting, recalculate the `max_event_in_parent_group` setting and increase it accordingly.
max_event_in_group = 10000

# The total number of events that can be created by an aggregation policy. 
# If you exceed this limit, ITSI clears all events associated with this aggregation policy from memory. The episodes
# are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes.
# This limit is calculated by multiplying `sub_group_limit` * `max_groups_per_sub_group` * `max_event_in_group` 
max_event_in_parent_group = 1000000000000

# An ACK token ensures that an event is being indexed before running an action on it.
# However, events are forwarded to the indexer from the search head, which adds another delay.
# This field (in milliseconds) adds an additional delay before running an action on events or groups.
action_execution_delay = 0

# When fetching events to perform actions on an episode, the amount of time, in seconds, to
# subtract from the earliest_time on the search before executing an action.
# This setting helps prevent grouping inaccuracies when events are milliseconds apart.
earliest_time_lag = 300

# The number of minutes in the past to check for grouping of duplicate events.
# For example, if you change this setting to "10", ITSI looks back 10 minutes prior to the current event. If an
# identical event was added to an episode in the last 10 minutes, the current event is ignored and not grouped.
event_grouping_dedup_period = 0

# The delay, in seconds, to batch update episode state. Otherwise, the KV store is accessed too often.
# It is recommended that you do not set this to a value below 20.
group_state_batch_delay = 28

# The event cache expiration limit.
# After this time passes, in seconds, ITSI begins to remove events from the cache.
event_cache_expiry_time = 180

# The maximum number of events the event cache can contain.
# Once the maximum is reached, the cache is cleared.
event_cache_max_entries = 1000000

# Whether to validate the current state of the Rules Engine upon startup. 
validate_rules_engine_state_on_startup = true

# The maximum number of times to restart the Rules Engine if it is not in the correct state upon startup.
max_rt_search_retry_count = 3

# The various types of error messages to check for at the start of a search job. 
# The presence of any of these messages could indicate potential problems.
# If any of these messages are present, the Rules Engine stops. 
exit_condition_message_pattern = (?i).*?nable to distribute to peer.*|.*?nable to distribute to the peer.*|.*?might have returned partial results.*|.*?earch results might be incomplete.*

# The maximum number of times to try a backfill search if any messages are detected that
# match those in the 'exit_condition_message_pattern' setting. These messages are encountered
# when a peer is unavailable or unreachable, which might cause the Rules Engine to miss events. 
backfill_search_retry_count = 3

# The maximum number of times to try any internal search other than a backfill search or a Rules Engine real-time 
# search if any messages are detected that match those in the 'exit_condition_message_pattern' setting. These messages
# are encountered when a peer is unavailable or unreachable, which might cause the Rules Engine to miss data.
search_retry_count = 3

# The amount of time to wait, in milliseconds, before retrying a search job.
search_retryperiod_ms = 500

##########
# ITSI Rules Engine - Resilience Manger Configuration
##########

# A Splunk search to get events that the Rules Engine failed to group.
# Use small timeframes when using this search considering the Splunk join command limitation of 50,000 rows.
grouping_missed_events_search = search `itsi_event_management_index_with_close_events` \
  | join type=left event_id [ search `itsi_event_management_group_index` | table event_id, itsi_group_id \
  | rename itsi_group_id as backfill_group_id ] \
  | where isnull(backfill_group_id) \
  | fields _time, _raw, source, sourcetype

# The frequency, in seconds, to remind each policy executor to check time-based policies on their criteria.
# There is a fixed interval between policy check executions to avoid overlap.
# If the execution of the policy check takes longer than the interval,
# the subsequent execution starts after the prior one completes, plus the provided interval.
policy_rules_check_frequency = 60

# The frequency, in seconds, that the Rules Engine syncs in-memory episode state to the KV store. 
# Otherwise, the KV store is accessed too often.
# This setting reminds each policy executor to batch update all of an episode's information in the KV store.
# There is a fixed interval between group state sync executions to avoid overlap.
policy_group_state_sync_frequency = 28

# The frequency, in seconds, that the resilience manager reprocesses events that were not grouped.
periodic_backfill_frequency = 720

# The sliding time window, in seconds, used by the resilience manager to reprocess
# events that were not grouped.
periodic_backfill_time_window = 3600

# The time gap to Rules Engine real-time search, in seconds, used by the resilience manager
# to reprocess events that were not grouped.
periodic_backfill_to_realtime_gap = 720

# The number of attempts that EventBackfillActor tries before stopping the periodic backfill search.
# Periodic backfill search job wait time is calculated from the job check limit.
# Periodic backfill search job wait time: f(T) = N(N+1)/2, where N = job check limit.
# If N is 15, the job wait time limit is 15(15+1)/2 = 120 seconds.
periodic_backfill_search_job_check_limit = 15
PREVIOUS
Manage notable events in ITSI
  NEXT
Enable bidirectional ticketing in ITSI

This documentation applies to the following versions of Splunk® IT Service Intelligence: 4.4.1


Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters