Troubleshoot the Rules Engine and event grouping in ITSI

Here are some common issues with notable event grouping in IT Service Intelligence (ITSI) and how to resolve them.

Events aren't being grouped into episodes

Even though notable events are coming into ITSI, aggregation policies aren't grouping them into episodes.

Solution

1. First, go to Configuration > Notable Event Aggregation Policies within ITSI and make sure aggregation policies are created and enabled.

2. Make sure the Rules Engine is running. Go to Activity > Jobs and change the App context to All. Search for itsi_event_grouping and make sure the status says Running. If it's not running, go to Settings > Searches, reports, and alerts and change the app context to All. Search for the itsi_event_grouping search again and check that it's enabled. If not, enable it.

3. Make sure you have the correct Java version. See Java requirements in the Install and Upgrade manual.

4. Make sure the itsi_grouped_alerts index exists and contains data. The index is deployed in the SA-IndexCreation directory.

5. If the Rules Engine is running, run the following search to check for error messages that might indicate what's causing the problem:

index=_internal source=*rules_engine* log_level=ERROR

6. If the previous steps fail, change the log level to debug at $SPLUNK_HOME/etc/apps/SA-ITOA/default/log4j_rules_engine.xml and finalize the old itsi_event_grouping search so a new one can pick up updated log level information:

<root level= "debug" >
     <!--<appender-ref ref= "Console" /> <!– To console –>-->
     <appender-ref ref= "RollingFile" /> <!-- And to a rotated file -->
</root>

7. If the problem persists, file a ticket with Splunk Support.

The Rules engine search is failing

You see the following error message:

Exception while parsing metadata JSON: Unexpected character in string: '\0A'

Solution

Update your Java version to 1.8 or higher. See Java requirements in the Install and Upgrade manual.
Set the JAVA_HOME environment variable to the new Java version.
Restart your Splunk software.

Events are being grouped by the default aggregation policy instead of a custom policy

Even though notable events are being ingested into ITSI, they aren't being grouped by any of the custom aggregation policies you've created.

Solution

Events are only grouped by the default aggregation policy when they don't match the filtering criteria of any existing policies. Check the filtering criteria of the policies you've created. For more information, see Configure episode filtering and breaking criteria in ITSI.

Events are added to new episodes instead of to existing active episodes

The sub-group limit (sub_group_limit) has been exceeded, so the hash keys and the episodes associated with them are cleared from memory. Any new events with the old hash key create new sub-groups, which causes new episode to be created.

Solution

Create a local copy of itsi_rules_engine.properties at $SPLUNK_HOME/etc/apps/SA-ITOA/local
Increase the sub_group_limit setting accordingly.

For more information about sub-groups and other sizing parameters, see Tune episode and aggregation policy sizing parameters in ITSI.

Java process not starting when the Rules Engine search is executed

The Java process isn't starting when the itsirulesengine command is executed and no itsi_rules_engine.log files are generated.

Solution

If you see the following error message:

Invalid message received from external search command during setup, see search.log

verify that /opt/splunk/etc/apps/SA-ITOA/bin/itsirulesengine has execute permissions. Run the following command from /opt/splunk/etc/apps/SA-ITOA/bin to see if you get a permission denied error:

/opt/splunk/etc/apps/SA-ITOA/bin/itsirulesengine -J-Xmx2048M -Dlog4j.configurationFile=../default/log4j_rules_engine.xml -DitsiRulesEngine.configurationFile=../default/itsi_rules_engine.properties -Dfile.encoding=UTF-8 -Dconfig.file=../default/akka_application.conf

If you get a permission denied error, add execute permission to the file.

Make sure no duplicate JAR files are present under /opt/splunk/etc/apps/SA-ITOA/lib/java/event_management/libs and that all the JAR files have execute permission for the Splunk user.
Make sure your Java is not 32-bit JRE/JDK. For more information, see Java requirements in the Install and Upgrade Manual.
Make sure rtsearch, the real-time search capability, isn't disabled in the [role_admin] stanza in $SPLUNK_HOME/etc/apps/itsi/default/authorize.conf.

"Connection Refused" error

You see the following error message:

error=Connection refused java.lang.RuntimeException: Connection refused

Solution

Restart the Rules Engine, as this could mean Splunkd wasn't set up properly:

Within Splunk Web, go to Settings > Searches, reports, and alerts.
In the App dropdown, select All.
Use the filter to locate the itsi_event_grouping search.
Click Actions > Disable.
Wait for about 10 seconds, then re-enable it again.

If the restart doesn't solve the problem, check your network permissions.

Notable event KV store collections are growing very large

This issue occurs because the indexed realtime search returns events over and over from buckets that use tsidx reduction.

Solution

Disable tsidx reduction on the itsi_tracked_alerts and itsi_summary indexes and rebuild all old buckets on these indexes. For more information, see Reduce tsidx disk usage in the Managing Indexers and Clusters of Indexers manual.

Upon upgrade, the Rules Engine search command fails

You see the following error message:

Error occurred during initialization of VM

Solution

This issue occurs because 32-bit Java can't run the Rules Engine with the new memory settings introduced in version 4.3.x.

Open or create a local copy of commands.conf at $SPLUNK_HOME/etc/apps/SA-ITOA/local.

Add the following stanza:

[itsirulesengine]
 command.arg.1=-J-Xmx1024M
 # reduced to 1024MB for 32 bit JDK/JRE

Restart the Rules Engine, either by disabling and reenabling the itsi_event_grouping search, or by restarting your Splunk software.

ITSI is constantly using 100% CPU because multiple java.exe processes are running

This issue occurs because ITSI Event Analytics is incompatible with Splunk Enterprise versions 7.2.4 - 7.2.10.

Solution

Perform the workaround in SPL-155648.

Notable Event Aggregation Policy filter doesn't work with special characters

Rules Engine errors occur if the filter criteria contains special characters like (, ), /, \, or *.

Solution

Include a \ character before each of the special characters.

Some search peers aren't contacted when realtime search runs

The realtime search itsi_event_grouping misses some events. If the there is delay in grouping, backfilling, or missing events, it could be because of this issue.

Solution

Compare the number of active peers against the number of hosts identified using steps below. If the number of hosts decreases over time, then it's likely that not all peers are being contacted. This data is correlated in time against the volume of skipped events.

Run the following search to determine the number of peers participating in the itsi_event_grouping search over time:
index="_introspection" sourcetype=splunk_resource_usage data.search_props.label=itsi_event_grouping (host=<idx1> OR host=<idx2>...) | timechart dc(host) useother=false
Determine how many indexers are active by selecting Settings then Indexer Cluster. If the count is more than the count of indexers from the results of the above search, proceed to the next step.
Perform a rolling restart on the stack, or restart the itsi_event_grouping search on the ITSI search head. After the search has been terminated, the scheduler will quickly restart it.

There are indexing delays

There could be multiple different causes of indexing delays and different methods to check for indexing delays. If the there is delay in grouping, backfilling, or missing events, it could be because of this issue.

Solution

Step 1: Detect indexing delays

You can detect index delays using one of the following methods:

Method 1: Check difference between _indextime and _time of the event

Run the following search to check indexing delays:
index=<indexname> | eval delay=_indextime-_time | stats min(delay) avg(delay) max(delay)

For the itsi_tracked_alerts index, it's not possible to check indexing delays unless the is_use_event_time is enabled for itsi_event_generator in SA-ITOA/local/alert_actions.conf. If enabled, the _time and org_time of the events retains the org_time of the event; otherwise, these fields are replaced by the _indextime.

If is_use_event_time is enabled, decide what _indextime should be based on how often the correlation search runs, and when the event was originally indexed.
Check for delays between _indextime and _time. A delay longer than 30 seconds indicates a problem.

Method 2: Check CPU raw data write usage

You can can look at Splunk Monitoring Console to check the CPU raw data write usage. If CPU raw data write usage is higher than the time Splunk spends processing events, then there are indexing delays.

Select Indexing in Monitoring Console.
Select Indexing Performance > Indexing Performance: Instance.
Check the Cumulative CPU seconds spent per indexer processor activity panel on the dashboard, which breaks down the utilization metrics for raw data write and the indexing service on an indexer.

Step 2: Tune the index time-realtime offset

By default the indextime-realtime searches use the default 60 second offset setting. This offset can be tuned for specific searches, but will add some latency on the realtime results. Adjust the offset until you have fewer backfilled events:

On the ITSI search head, select Settings > Saved searches and alerts > "itsi_event_grouping" on the SA-ITOA app.
Select the Advanced edit setting. Update the value of the dispatch.indexedRealtimeOffset setting to a value greater than 60 seconds.
Disable, then re-enable the saved search itsi_event_grouping and force the search to restart.
Use the ITSI Events Analytics Monitoring dashboard to compare the new settings to the previous settings using the Backfilled events percentage panel. For more information about the Event Analytics Monitoring dashboard, see Event Analytics Monitoring dashboard.

You may have to continue updating the value of the setting with different values, such as 60 to 90, to 120, until you reduce the number of backfilled events drops.

Search peers won't connect and cause a delay in grouping events

If the there is delay in grouping, backfilling, or missing events, it could be because of this issue.

Solution

Use the following command to identify if there are issues with the search peers and if the results match the timeline of missing or delayed events:

index="_internal" source="*splunkd*" "unable to distribute to the peer" OR "might have returned partial results" OR "search results might be incomplete" OR "search results may be incomplete" OR "unable to distribute to peer" OR "search process did not exit cleanly" OR "streamed search execute failed" OR "failed to create a bundles setup with server"

If issues continue, contact Splunk Support.

Rules Engine periodic backfill search stops working due to auto-scaling

In a Splunk Cloud Platform deployment, the number of indexers can automatically scale due to higher ingest and search workloads. During this auto-scaling event, indexers are unavailable and the backfill search can't run on the indexers. When this occurs, you will see the following warning message:

The Rules Engine Periodic Backfill search has exhausted retry limits due to peer unavailability. The search will miss all remaining data from the peer. Please check the health of the Indexers.

Solution

This is only a warning that goes away once the auto-scaling event completes. Wait for the event to complete and then check on the health of the indexers.

Alert action fails and spath command truncates required fields

If an action such as Create ServiceNow incident fails, the event size may be too large and the spath command is truncating the required fields, causing alert actions to fail.

Solution

In $SPLUNK_HOME/etc/apps/itsi/local/limits.conf add the following lines:

[spath]
extraction_cutoff = <character_limit>

Delay in episode action execution

When the itsi_notable_event_actions_queue exceeds a large number of actions, this causes latency in the execution of episode actions.

Solution

Increase the queue consumer counts by navigating to Settings then Data inputs then IT Service Intelligence Actions Queue Consumer once the itsi_notable_event_actions_queue limit reaches 10k or higher.

To manually increase the queue consumer counts in a search head cluster environment, make a local copy of the inputs.conf file at $SPLUNK_HOME/etc/apps/SA-ITOA/local/inputs.conf. Edit the [itsi_notable_event_actions_queue_consumer://particular consumer] stanza and set the disabled flag to 0 for the particular consumer.

Once the itsi_notable_event_actions_queue has exceeded 100k actions, check if there are issues with the queue consumers or any particular actions, and check if any unnecessary actions are being added to the queue. If there are unnecessary actions, clear the itsi_notable_event_actions_queue KV store. For more information, see Clear notable events.

You can also create a new queue consumer when your current consumers are handling too many episode actions by navigating to Settings then Data inputs then IT Service Intelligence Actions Queue Consumer, and selecting New. Fill in the fields to create a new consumer.

Related answers from Splunk Community

Troubleshoot the Rules Engine and event grouping in ITSI

Events aren't being grouped into episodes

Solution

The Rules engine search is failing

Solution

Events are being grouped by the default aggregation policy instead of a custom policy

Solution

Events are added to new episodes instead of to existing active episodes

Solution

Java process not starting when the Rules Engine search is executed

Solution

"Connection Refused" error

Solution

Notable event KV store collections are growing very large

Solution

Upon upgrade, the Rules Engine search command fails

Solution

ITSI is constantly using 100% CPU because multiple java.exe processes are running

Solution

Notable Event Aggregation Policy filter doesn't work with special characters

Solution

Some search peers aren't contacted when realtime search runs

Solution

There are indexing delays

Solution

Step 1: Detect indexing delays

Step 2: Tune the index time-realtime offset

Search peers won't connect and cause a delay in grouping events

Solution

Rules Engine periodic backfill search stops working due to auto-scaling

Solution

Alert action fails and spath command truncates required fields

Solution

Delay in episode action execution

Solution

Comments

Troubleshoot the Rules Engine and event grouping in ITSI

Was this topic useful?