Tune notable event grouping in ITSI
Notable event aggregation polices group notable events to organize them in Episode Review. ITSI provides a file called
itsi_rules_engine.properties, located at
$SPLUNK_HOME/etc/apps/SA-ITOA/default/, where you can tune and customize notable event grouping settings.
Upgrading rewrites the
itsi_rules_engine.properties file and your changes are not saved. Always save a local copy of this file and re-add your changes and additions after you upgrade.
- Open the
- Modify the following settings as necessary to improve notable event grouping on your deployment.
# The period, in seconds, at which to fetch aggregation policies from the KV store. policy_fetch_period = 45 # The HTTP token name. token_name = itsi_group_alerts_token # The HTTP sync token name. # NOTE: If the sync token name and the HTTP token name are the same, a token # with async functionality is created. sync_token_name = itsi_group_alerts_token # The timeout value for receiving an acknowledgement from HEC. # When processing a notable event and the action criteria is met, this setting # ensures that the current event is indexed before executing an action. http_ack_time_out = 10 # The number of split-by hash keys that can exist for a single aggregation policy that splits events # by field(s). Split-by hash keys are the possible combinations of values from individual split-by fields. # For example, if an aggregation policy is split by 'host' and 'severity', it creates separate hash keys # for the host-severity combinations of host1 and severity high, host1 and severity low, host2 # and severity low, etc. If episodes are created for 10000 different host-severity combinations, the limit is reached. # If you exceed this limit, the hash keys and the episodes associated with them are cleared from memory. The episodes # are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes. # If you increase this setting, recalculate the `max_event_in_parent_group` setting and increase it accordingly. sub_group_limit = 10000 # The number of episodes that can be created for each split-by hash key for an # aggregation policy that splits events by field(s). # If you exceed this limit, the episodes associated with the hash key are cleared from memory. The episodes # are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes. # If you increase this setting, recalculate the `max_event_in_parent_group` setting and increase it accordingly. max_groups_per_sub_group = 10000 # The maximum number of events that can be contained within a single episode. # If you exceed this limit, the episode breaks and a new episode is created. # If you increase this setting, recalculate the `max_event_in_parent_group` setting and increase it accordingly. max_event_in_group = 10000 # The total number of events that can be created by an aggregation policy. # If you exceed this limit, ITSI clears all events associated with this aggregation policy from memory. The episodes # are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes. # This limit is calculated by multiplying `sub_group_limit` * `max_groups_per_sub_group` * `max_event_in_group` max_event_in_parent_group = 100000000 # An ACK token ensures that an event is being indexed before running an action on it. # However, events are forwarded to the indexer from the search head, which adds another delay. # This field (in milliseconds) adds an additional delay before running an action on events or groups. action_execution_delay = 0 # When fetching events to perform actions on an episode, the amount of time, in seconds, to # subtract from the earliest_time on the search before executing an action. # This setting helps prevent grouping inaccuracies when events are milliseconds apart. earliest_time_lag = 300 # The number of minutes in the past to check for grouping of duplicate events. # For example, if you change this setting to "10", ITSI looks back 10 minutes prior to the current event. If an # identical event was added to an episode in the last 10 minutes, the current event is ignored and not grouped. event_grouping_dedup_period = 0 # The delay, in seconds, to batch update episode state. Otherwise, the KV store is accessed too often. # It is recommended that you do not set this to a value below 20. group_state_batch_delay = 28 # The event cache expiration limit. # After this time passes, in seconds, ITSI begins to remove events from the cache. event_cache_expiry_time = 180 # The maximum number of events the event cache can contain. # Once the maximum is reached, the cache is cleared. event_cache_max_entries = 1000000 # Whether to validate the current state of the Rules Engine upon startup. validate_rules_engine_state_on_startup = true # The maximum number of times to restart the Rules Engine if it is not in the correct state upon startup. max_rt_search_retry_count = 3 # The various types of error messages to check for at the start of a search job. # The presence of any of these messages could indicate potential problems. # If any of these messages are present, the Rules Engine stops. exit_condition_message_pattern = (?i).*?nable to distribute to peer.*|.*?nable to distribute to the peer.*|.*?might have returned partial results.*|.*?earch results might be incomplete.* # The maximum number of times to try a backfill search if any messages are detected that # match those in the 'exit_condition_message_pattern' setting. These messages are encountered # when a peer is unavailable or unreachable, which might cause the Rules Engine to miss events. backfill_search_retry_count = 3 # The maximum number of times to try any internal search other than a backfill search or a Rules Engine real-time # search if any messages are detected that match those in the 'exit_condition_message_pattern' setting. These messages # are encountered when a peer is unavailable or unreachable, which might cause the Rules Engine to miss data. search_retry_count = 3 # The amount of time to wait, in milliseconds, before retrying a search job. search_retryperiod_ms = 500 ########## # ITSI Rules Engine - Resilience Manger Configuration ########## # A Splunk search to get events that the Rules Engine failed to group. # Use small timeframes when using this search considering the Splunk join command limitation of 50,000 rows. grouping_missed_events_search = search `itsi_event_management_index_with_close_events` \ | join type=left event_id [ search `itsi_event_management_group_index` | table event_id, itsi_group_id \ | rename itsi_group_id as backfill_group_id ] \ | where isnull(backfill_group_id) \ | fields _time, _raw, source, sourcetype # The frequency, in seconds, to remind each policy executor to check time-based policies on their criteria. # There is a fixed interval between policy check executions to avoid overlap. # If the execution of the policy check takes longer than the interval, # the subsequent execution starts after the prior one completes, plus the provided interval. policy_rules_check_frequency = 60 # The frequency, in seconds, that the Rules Engine syncs in-memory episode state to the KV store. # Otherwise, the KV store is accessed too often. # This setting reminds each policy executor to batch update all of an episode's information in the KV store. # There is a fixed interval between group state sync executions to avoid overlap. policy_group_state_sync_frequency = 28 # The frequency, in seconds, that the resilience manager reprocesses events that were not grouped. periodic_backfill_frequency = 720 # The sliding time window, in seconds, used by the resilience manager to reprocess # events that were not grouped. periodic_backfill_time_window = 3600 # The time gap to Rules Engine real-time search, in seconds, used by the resilience manager # to reprocess events that were not grouped. periodic_backfill_to_realtime_gap = 720 # The number of attempts that EventBackfillActor tries before stopping the periodic backfill search. # Periodic backfill search job wait time is calculated from the job check limit. # Periodic backfill search job wait time: f(T) = N(N+1)/2, where N = job check limit. # If N is 15, the job wait time limit is 15(15+1)/2 = 120 seconds. periodic_backfill_search_job_check_limit = 15
Manage notable events in ITSI
Enable bidirectional integration with an external ticketing system in ITSI
This documentation applies to the following versions of Splunk® IT Service Intelligence: 4.3.0, 4.3.1