Rules Engine properties reference in ITSI
The IT Service Intelligence (ITSI) Rules Engine is a system for continuously processing notable events. It uses the aggregation policies you configure to group notable events into episodes. It's also the container for action rules that automate episode actions, such as sending an email or pinging a host. For more information about the Rules Engine, see About the ITSI Rules Engine.
Tune notable event grouping settings
ITSI provides a file called
itsi_rules_engine.properties, located at
$SPLUNK_HOME/etc/apps/SA-ITOA/default/, where you can tune and customize notable event grouping settings.
- Only users with file system access, such as system administrators, can tune notable event grouping using a configuration file.
- Review the steps in How to edit a configuration file in the Admin Manual.
Never change or copy the configuration files in the default directory. The files in the default directory must remain intact and in their original location.
- Open or create a local
- Modify the following settings as necessary to improve notable event grouping on your deployment.
# The period, in seconds, at which to fetch aggregation policies from the KV store. policy_fetch_period = 45 # The group index name. index_name = itsi_grouped_alerts # The HTTP token name. token_name = itsi_group_alerts_token # The HTTP sync token name. # NOTE: If the sync token name and the HTTP token name are the same, a token # with async functionality is created. sync_token_name = itsi_group_alerts_token # The timeout value for receiving an acknowledgement from HEC. # When processing a notable event and the action criteria is met, this setting # ensures that the current event is indexed before executing an action. http_ack_time_out = 10 # The default source. default_source = itsi_group_alerts # The default sourcetype. default_sourcetype = itsi_notable:group # The number of split-by hash keys that can exist for a single aggregation policy that splits events # by field(s). Split-by hash keys are the possible combinations of values from individual split-by fields. # For example, if an aggregation policy is split by 'host' and 'severity', it creates separate hash keys # for the host-severity combinations of host1 and severity high, host1 and severity low, host2 # and severity low, etc. If episodes are created for 200000 different host-severity combinations, the limit is reached. # If you exceed this limit, the oldest hash key and the episodes associated with it is cleared from memory. The episodes # are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes. sub_group_limit = 1000000 # The offset used to calculate when to alert that the sub-group limit is approaching the default value of 1000000. # The Rules Engine creates a message in the "Messages" dropdown in Splunk when the sub-group limit is greater than or # equal to the value of the 'sub_group_limit' setting minus the value of the 'subgroup_alert_limit_offset' setting. # For example, if the sub-group limit is 1000000 and the offset is 500, the Rules Engine sends an alert when the sub-group # limit is greater than or equal to 999500 (1000000 - 500). subgroup_alert_limit_offset = 500 # The number of episodes that can be created for each split-by hash key for an # aggregation policy that splits events by field(s). # If you exceed this limit, the episodes associated with the hash key are cleared from memory. The episodes # are still saved in the KV store, and events are stored in itsi_tracked_alerts and itsi_grouped_alerts indexes. max_groups_per_sub_group = 50 # The maximum number of events that can be contained within a single episode. # If you exceed this limit, the episode breaks and a new episode is created. max_event_in_group = 10000 # An ACK token ensures that an event is being indexed before running an action on it. # However, events are forwarded to the indexer from the search head, which adds another delay. # This field (in milliseconds) adds an additional delay before running an action on events or groups. action_execution_delay = 0 # The number of batch actions pushed to the KV store in one iteration. # By default, 5000 actions are pushed to the KV store every 'action_batch_flush_period' (100 milliseconds). action_batch_size = 5000 # The overall connection limit for an HTTP connection pool used to flush actions to the KV store. # By default, a pool of two consumer threads consumes actions from the queue and pushes them to the KV store. action_batch_consumer_count = 2 # The time period, in milliseconds, when batch actions are asynchronously flushed to the KV store. # By default, 5000 actions are pushed to KV store every batch flush period (100 milliseconds). action_batch_flush_period = 100 # The number of events sent in a single batch to be indexed in the itsi_grouped_alerts index in one iteration. # By default, 10000 events are pushed using HEC to the itsi_grouped_alerts index every batch flush period (100 milliseconds). hec_batch_size = 10000 # The overall connection limit for an HTTP connection pool used to flush events to the itsi_grouped_alerts index using HEC. # By default, a pool of five consumer threads consumes events from the queue and pushes them to the itsi_grouped_alerts index. hec_batch_consumer_count = 5 # The time period, in milliseconds, when batch events are asynchronously flushed to the index using HEC. # By default, 10000 events are pushed to the itsi_grouped_alerts index every 'hec_batch_flush_period' (100 milliseconds). hec_batch_flush_period = 100 # The frequency, in milliseconds, to remind each policy executor to check time-based policies on their criteria. # There is a fixed interval between policy check executions to avoid overlap. # If the execution of the policy check takes longer than the interval, # the subsequent execution starts after the prior one completes, plus the provided interval. policy_rules_check_frequency_delay = 60000 # The frequency, in milliseconds, that the Rules Engine syncs in-memory episode state to the KV store. # Otherwise, the KV store is accessed too often. # This setting reminds each policy executor to batch update all of an episode's information in the KV store. # There is a fixed interval between group state sync executions to avoid overlap. kvstore_update_frequency_delay = 10000 # The frequency, in milliseconds, to remove groups and sub-groups from memory. # Refreshing removes closed groups from sub-groups and removes sub-groups that have all their groups closed. # Increase this setting to avoid hitting the 'sub_group_limit' and 'max_groups_per_sub_groups' limits. # Default: 10800000 milliseconds (3 hours) refresh_subgroups_frequency_delay = 10800000 # When fetching events to perform actions on an episode, the amount of time, in seconds, to # subtract from the earliest_time on the search before executing an action. # This setting helps prevent grouping inaccuracies when events are milliseconds apart. earliest_time_lag = 300 # The number of minutes in the past to check for grouping of duplicate events. # For example, if you change this setting to "10", ITSI looks back 10 minutes prior to the current event. If an # identical event was added to an episode in the last 10 minutes, the current event is ignored and not grouped. event_grouping_dedup_period = 0 # The delay, in seconds, to batch update episode state. Otherwise, the KV store is accessed too often. # It is recommended that you do not set this to a value below 20. group_state_batch_delay = 28 # The event cache expiration limit. # After this time passes, in seconds, ITSI begins to removes processed events from the cache. event_cache_expiry_time = 300 # The maximum number of events the event cache can contain. # Once the maximum is reached, the cache is cleared. event_cache_max_entries = 10000000 # Whether to validate the current state of the Rules Engine upon startup. validate_rules_engine_state_on_startup = true # The maximum number of times to restart the Rules Engine if it is not in the correct state upon startup. max_rt_search_retry_count = 3 # The various types of error messages to check for at the start of a search job. # The presence of any of these messages could indicate potential problems. # If any of these messages are present, the Rules Engine stops. exit_condition_messages_contain = unable to distribute to the peer,might have returned partial results,search results might be incomplete,search results may be incomplete,unable to distribute to peer,search process did not exit cleanly,streamed search execute failed,failed to create a bundles setup with server # Indicates whether the search job can provide partial results if a search peer fails. # When set to "false", the search job fails if a search peer that's providing results for the search job fails. allow_partial_results = false # The time range, in hours, to look back when running the backfill search upon startup to restore active # groups into memory. Tune this setting to the amount of time that episodes in your environment tend to # remain open without receiving new events. # By default the Rules Engine looks back 2160 hours (90 days). group_restore_lookback_time = 2160 # The maximum number of times to try a backfill search if any messages are detected that # match those in the 'exit_condition_messages_contain' setting. These messages are encountered # when a peer is unavailable or unreachable, which might cause the Rules Engine to miss events. backfill_search_retry_count = 3 # The maximum number of times to try any internal search other than a backfill search or a Rules Engine real-time # search if any messages are detected that match those in the 'exit_condition_messages_contain' setting. These messages # are encountered when a peer is unavailable or unreachable, which might cause the Rules Engine to miss data. search_retry_count = 3 # The amount of time to wait, in milliseconds, before retrying a search job. search_retryperiod_ms = 500 # Max retries for rolling restart status check. Rules Engine will keep retrying indefinitely when value is set to '0' max_cluster_rolling_restart_retry_count=0 ########## # ITSI Rules Engine - Resilience Manger Configuration ########## # The frequency, in seconds, that the resilience manager reprocesses events that were not grouped. periodic_backfill_frequency = 720 # The sliding time window, in seconds, used by the resilience manager to reprocess # events that were not grouped. periodic_backfill_time_window = 3600 # The time gap to Rules Engine real-time search, in seconds, used by the resilience manager # to reprocess events that were not grouped. periodic_backfill_to_realtime_gap = 720 # The number of attempts that EventBackfillActor tries before stopping the periodic backfill search. # Periodic backfill search job wait time is calculated from the job check limit. # Periodic backfill search job wait time: f(T) = N(N+1)/2, where N = job check limit. # If N is 15, the job wait time limit is 15(15+1)/2 = 120 seconds. periodic_backfill_search_job_check_limit = 15 # The number of entries per page when paginating Rules Engine searches. internal_search_page_size = 10000
Overview of the ITSI Rules Engine
Tune episode and aggregation policy sizing parameters in ITSI
This documentation applies to the following versions of Splunk® IT Service Intelligence: 4.8.0 Cloud only, 4.8.1 Cloud only, 4.9.0, 4.9.1, 4.9.2, 4.9.3, 4.9.4, 4.9.5, 4.9.6, 4.10.0 Cloud only, 4.10.1 Cloud only, 4.10.2 Cloud only, 4.10.3 Cloud only, 4.10.4 Cloud only, 4.11.0, 4.11.1, 4.11.2, 4.11.3, 4.11.4, 4.11.5, 4.11.6, 4.12.0 Cloud only, 4.12.1 Cloud only, 4.13.0, 4.13.1, 4.14.0 Cloud only