Health Monitor status code reference
Use the status codes provided by Splunk UBA in the Health Monitor dashboard to help troubleshoot issues in your deployment. See Monitor the health of your Splunk UBA deployment for information about the dashboard.
Health monitor indicators do not stop any processes from running. For example, if a rule's execution time has exceeded a threshold and a warning is generated, its execution is not interrupted.
Analytics Writer Service (ANL_WRT)
Splunk UBA displays several analytics writer (ANL_WRT) status codes to help you troubleshoot any issues with the analytics writer services.
ANL_WRT-1-AnalyticsTopic
Status | WARN |
Error message | Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> |
What is happening | Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
ANL_WRT-3-AnalyticsTopic
Status | WARN |
Error message | Instance has not consumed events for <x mins/hrs>. |
What is happening | The instance has not consumed any events in the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
ANL_WRT-1-IRTopic
Status | WARN |
Error message | Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> |
What is happening | Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
ANL_WRT-3-IRTopic
Status | WARN |
Error message | Instance has not consumed events for <x mins/hrs>. |
What is happening | The instance has not consumed any events in the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
Data Sources (DS)
Splunk UBA displays several data source (DS) status codes to help you troubleshoot any issues with the data source services.
DS-1
Status | ERROR |
Error message | EPS of one/more Data Sources was 0 for over a day. |
What is happening | Zero events per second were processed for one or more low-frequency data sources for more than 5 days, or high-frequency data sources for more than 1 day.
Each data source is categorized as low-frequency or high-frequency during a training period of 10 days, where it is polled once per hour (240 polls). Configure this threshold using the
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Restart data source(s) in BAD state, making sure events are sent to Splunk UBA. Contact Splunk Support if the status does not return to OK. |
DS-2
Status | WARN |
Error message | EPS of one/more Data Sources was 0 for over 6 hours. |
What is happening | Zero events per second were processed for one or more low-frequency data sources for more than 3 days or high-frequency data sources for more than 6 hours.
Each data source is categorized as low-frequency or high-frequency during a training period of 10 days, where it is polled once per hour (240 polls). Configure this threshold using the
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Keep monitoring for the next 18 hours. If the status does not go back to OK or keeps fluctuating, restart data source(s) in WARN state. Check that events are sent to Splunk UBA. |
DS-3
Status | ERROR |
Error message | Number of skipped events is more than the configured threshold. |
What is happening | The number of skipped events (for example, Unknown, EpochTooLow, EpochTooHigh, EpochDefault, EventBlacklisted, EventFiltered, EventHasNoEntities, IpAddressOutOfRange, AllUsersUnknown, AD3DigitGroupEventCode, PANIgnoredSesssionEvent, DuplicateEvent, EventNotRelevant) exceeds the configured threshold.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | In Splunk UBA, go to Manage > Data Sources. Under each data source, determine why the events are skipped. Contact Splunk Support if the reason is neither UI event filters nor HR data. |
DS-5
Status | ERROR |
Error message | No events were processed for one or more data formats. |
What is happening | For each data format, the average frequency of events stored in Impala per poll is calculated in the default training period of 10 polls. You can customize this number of polls using the ubaMonitor.DS.eventsPerFormat.trainingPeriod property.
|
What to do | Verify that the Splunk platform is ingesting data. If data is being ingested:
If data is not being ingested by the Splunk platform, fix data on-boarding on the Splunk platform. |
DS-6
Status | WARN |
Error message | HR data ingestion has warnings. |
What is happening | Possible causes:
|
What to do | Verify that the HR data source is configured properly and is ingesting events by going to the Data Sources page in Splunk UBA. See Validate HR data configuration before adding other data sources.
|
DS-7
Status | BAD |
Error message | HR data ingestion has errors. |
What is happening | The amount of time since the latest execution of any of the jobs is more than ubaMonitor.DS.HR.poll.bad.threshold . The default threshold is 72 hours.
|
What to do | Restart the HR data source. If the warning persists, contact Splunk Support. |
DS-8
Status | WARN |
Error message | Asset data ingestion has warnings. |
What is happening | Possible causes:
|
What to do | Verify that the assets data source is configured properly and is ingesting events by visiting Data Sources page in Splunk UBA.
|
DS-9
Status | BAD |
Error message | Asset data ingestion has errors. |
What is happening | The amount of time since the latest execution of any of the jobs is more than ubaMonitor.DS.asset.poll.bad.threshold . The default threshold is 72 hours.
|
What to do | Restart the assets data source. If the warning persists, contact Splunk Support. |
ENUM_MISMATCH_WARN
Status | WARN |
Error message | enum mismatch beyond warn threshold. |
What is happening | The ratio of bad events over total events is between 0.1 and 0.2. |
What to do | Stop the affected data source and make sure Splunk UBA is able to understand enum fields. Take one of two actions:
For more information, see Monitor the quality of data sent from the Splunk platform in Get Data into Splunk User Behavior Analytics. |
ENUM_MISMATCH_BAD
Status | BAD |
Error message | enum mismatch beyond error threshold. |
What is happening | The ratio of bad events over total events is greater than 0.2. |
What to do | Stop the affected data source and make sure Splunk UBA is able to understand enum fields. Take one of two actions:
For more information, see Monitor the quality of data sent from the Splunk platform in Get Data into Splunk User Behavior Analytics. |
DS_LAGGING_WARN
Status | WARN |
Error message | Data source micro-batch search lag is over the threshold. |
What is happening | Data source services monitor the Splunk data source ingestion search lag, including Kafka data ingestion. The lag is defined as the duration between search submission time and the search's latest time. If lag is beyond 3600 seconds, warning message is displayed.
See Configure Kafka data ingestion in the Splunk UBA Kafka Ingestion App manual. Configure this threshold by adding or editing the |
What to do | Stop the affected data source and try to split it into multiple sources to keep each data source's EPS small. |
DS_SEARCH_STATUS_BAD
Status | BAD |
Error message | Data source micro-batch search has returned an error. |
What is happening | Data source services monitor the Splunk data source ingestion search status, including Kafka data ingestion. This indicator tracks if the search issued by the data source is healthy. An alert is triggered when the number of times the search returns any error is more than ubaMonitor.DS.search.highFreq.bad.pollCount times in a row. The default is 3 times in a row.
Configure this threshold by adding or editing the |
What to do | Stop the affected data source and check on the Splunk platform's job inspector for more debug information. |
Kafka Broker (KAFKA)
Splunk UBA displays one Kafka status code to help you troubleshoot any issues with the Kafka broker.
KAFKA-1
Status | ERROR |
Error message | Kafka topics are not receiving events. |
What is happening | There may be an issue in kafka-server resulting in events not writing to their topics. This error is triggered when data is being ingested but the indicator item values are null (when there are exceptions from KafkaJMXClient) for badPollCount number of polls. To check if data is being ingested, check if the difference between eventsIngested by connectors during first poll required for status change(last pollCount #) and the latest poll is greater than eventsIngestedDiffThreshold.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Monitor for an hour, and if the status does not return to OK, restart kafka-server . If Kafka topics continue to not receive events, contact Splunk Support.
|
Offline Rule Executor (ORE)
Splunk UBA displays several offline rule executor (ORE) status codes to help you troubleshoot any issues with the offline rule executor services.
ORE-1
Status | WARN |
Error message | One or more rules have failed to run. |
What is happening | One or more custom threats, anomaly action rules, or anomaly rules failed to run in the current session.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Monitor the number of failed runs compared to the total number of runs for the rule over the following days. If the number of failed runs increases or remains steady, contact Splunk Support. There is no need to contact Splunk Support if the number of failed runs decreases. |
ORE-2
Status | ERROR |
Error message | One or more rules consistently failed to run. |
What is happening | One or more custom threats, anomaly action rules, or anomaly rules have consistently failed to run in the current session. This error is triggered when more than 20 percent of the executions of at least 1 rule have failed in the current session.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Contact Splunk Support. |
ORE-3
Status | WARN |
Error message | Average execution time per rule |
What is happening | At least one custom threat, anomaly action rule, or anomaly rule has exceeded the warning threshold. |
What to do | If the rules in WARN status are executing without failures, edit the ubaMonitor.ore.ruleDuration.warn.duration property to increase the time threshold before a warning is generated. The default is 30 minutes. After setting the property:
Click on the link in the Average Execution Time per Rule on the Data Quality Indicators page to check the rule execution times. |
ORE-4
Status | WARN |
Error message | Threat revalidation is slower than normal. |
What is happening | During the current Realtime Rule Executor session, it is taking longer than a specified threshold of time to revalidate threats based on changes to anomalies, such as new scores, deleted anomalies, or suppressed anomalies.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Contact Splunk Support. |
Realtime Rule Executor (RRE)
Splunk UBA displays several realtime rule executor (RRE) status codes to help you troubleshoot any issues with the realtime rule executor services.
RRE-1
Status | WARN |
Error message | New anomalies are being processed slowly and the processing speed is slowing down. |
What is happening | The average events per second for processing new anomalies is slowing. High numbers of suppressed anomalies affects processing new anomalies. This warning is triggered when the 10 minutes average EPS keeps dropping (or increasing by less than 0.2) for the last number of (pollCount) polls and in all polls it is below the WARN configurable threshold. If no threshold is specified, this status is disabled.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Restart the Realtime Rule Executor service.
If the error continues to appear, check the number of suppressed anomalies on the Data Quality Metrics page on the Health Monitor dashboard. If the number of suppressed anomalies is in the millions, delete some of the suppressed anomalies. See Delete anomalies in Splunk UBA. |
RRE-2
Status | ERROR |
Error message | New anomalies are no longer being processed. |
What is happening | The processing speed of new anomalies is low enough that no anomalies were processed in the last 10 minute period. This error is triggered when the 10 minutes average EPS keeps dropping (or increasing by less than 0.2) for the last number of (pollCount) polls and in all polls it is below the BAD configurable threshold. If no threshold is specified, this status is disabled.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Restart the Realtime Rule Executor service.
If the error continues to appear, check the number of suppressed anomalies on the Data Quality Metrics page on the Health Monitor dashboard. If the number of suppressed anomalies is in the millions, delete some of the suppressed anomalies. See Delete anomalies in Splunk UBA. |
RRE-3
Status | WARN |
Error message | Minor anomaly loss has been detected. |
What is happening | Kafka has started to drop events from NewAnomalyTopic. This warning is triggered when the percentage of missed events is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Restart the Realtime Rule Executor. The indicator's status will reset and its status recalculated in the next hour. If the status returns to WARN, contact Splunk Support.
To restart the Realtime Rule Executor, perform the following steps:
|
RRE-4
Status | ERROR |
Error message | A significant percentage of anomalies is being dropped by Kafka. |
What is happening | Kafka is dropping a significant number of events from NewAnomalyTopic. This error is triggered when the percentage of missed events exceeds the limit% (inclusive) for the last number of (pollCount) polls.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Restart the Realtime Rule Executor. The indicator's status will reset and its status recalculated in the next hour. If the status returns to WARN, contact Splunk Support.
To restart the Realtime Rule Executor, perform the following steps:
|
PostgreSQL (PSQL)
Splunk UBA displays one PostgreSQL (PSQL) status code to help you troubleshoot any issues with the PostgreSQL service.
PSQL-1
Status | WARN |
Error message | The number of suppressed anomalies is too high. |
What is happening | A high volume of suppressed anomalies has led to slow system performance. This warning is triggered when the number of suppressed anomalies in PostgreSQL surpassed a configurable threshold. If no threshold is specified, this status is disabled.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Reduce the number of suppressed anomalies to improve performance. Delete anomalies from the anomalies trash. See Delete anomalies in Splunk UBA. |
Offline Models (OML)
Splunk UBA displays several offline model (OML) status codes to help you troubleshoot any issues with the offline model services.
OML-1
Status | WARN |
Error message | One or more models have not executed for (x) hours. |
What is happening | One or more models have not executed successfully in the past 2 days.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 48 hours. If the OK status does not return during that time, contact Splunk Support. |
OML-2
Status | ERROR |
Error message | One or more models have not executed for the past 3 days. |
What is happening | One or more models have not executed successfully in the specified interval of time.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Contact Splunk Support. |
Output Connector Service (OCS)
Splunk UBA displays several output connector service (OCS) status codes to help you troubleshoot any issues with the output connector services.
OCS-4
Status | ERROR |
Error message | Percentage of email failures is more than the configured threshold. |
What is happening | The percentage of email failures exceeds the configured threshold.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | In Splunk UBA, go to Manage > Output Connectors and verify that the Email Connector is correctly configured. |
OCS-5
Status | WARN |
Error message | Minor anomaly loss has been detected. |
What is happening | Kafka is dropping events from AnomalyTopic. This warning is triggered when the percentage of missed anomalies is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support.
To restart the Output Connector Server:
|
OCS-6
Status | ERROR |
Error message | A significant percentage of anomalies is being dropped from Kafka. |
What is happening | Kafka is dropping a significant percentage of events from AnomalyTopic. This error is triggered when the percentage of missed anomalies exceeds limit% (inclusive) for the last number of (pollCount) polls.
You can perform the following tasks to customize the thresholds for when the error is triggered:
|
What to do | Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support.
To restart the Output Connector Server:
|
OCS-8
Status | WARN |
Error message | Minor event loss has been detected. |
What is happening | Kafka is dropping events from OutputConnectorTopic. This warning is triggered when the percentage of missed events is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Restart the Output Connector Server. The indicator's status will be reset and in the next hour, its status will be recalculated. If the status goes back to WARN again, contact Splunk Support.
To restart the Output Connector Server:
|
OCS-9
Status | ERROR |
Error message | A significant percentage of events are dropping from Kafka. |
What is happening | Kafka is dropping a significant percentage of events from OutputConnectorTopic. This error is triggered when the percentage of missed events exceeds limit% (inclusive) for the last number of (pollCount) polls.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do | Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support.
To restart the Output Connector Server:
|
OCS-11
Status | ERROR |
Error message | The latest threat feed has been halted. |
What is happening | Splunk UBA has not been able to send threats to Splunk for the past hour. This error is triggered when the same batch of threats has failed to be sent to Splunk ES for the last number of (badPollCount) polls.
You can perform the following tasks to customize the thresholds for when an error is triggered:
|
What to do |
The indicator's status will reset and its status will recalculate in the next hour. If the status returns to ERROR, contact Splunk Support. |
Streaming Models (SML)
Splunk UBA displays several streaming model (SML) status codes to help you troubleshoot any issues with the streaming model services.
SML-1
Status | WARN |
Error message | Time lag of the consumer exceeded x% of topic's retention(<topicRetention>) for <x mins/hrs>. Consumer may start missing events soon. |
What is happening | Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when either of the following conditions are met:
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
SML-3
Status | WARN |
Error message | Instance has not consumed events for <x mins/hrs>. |
What is happening | The instance has not consumed any events in the specified interval of time. This warning is triggered when instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
Threat Computation Task (TC)
Splunk UBA displays several threat computation (TC) status codes to help you troubleshoot any issues with the threat computation task.
TC-1
Status | WARN |
Error message | Threat computation is taking more than a certain number of minutes to complete. |
What is happening | The last time that the threat computation process ran, it took more than duration number of minutes to compute threats in Splunk UBA. High system loads can contribute to a longer time period for threat computation.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
The threat computation task is run as part of |
What to do | Monitor this indicator for the next day or two. If it continues to appear, contact Splunk Support. |
TC-2
Status | WARN |
Error message | Graph computation and threat calculation is taking more than a certain number of minutes to complete. |
What is happening | Graph computation is slowing the process of threat computation. High system load can contribute to a longer time period for graph computation of threats. This warning is triggered when the graph computation part of the latest invocation of the task took more than graphDuration minutes to complete.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
The threat computation task is run as part of |
What to do | Monitor this indicator for the next day or two. If it continues to appear, contact Splunk Support. |
TC-3
Status | WARN |
Error message | The most recent threat computation failed. |
What is happening | The most recent process to compute new threats failed to run. |
What to do | The task runs hourly. If threat computation fails regularly or often, contact Splunk Support. |
UBA ETL Service (ETL)
Splunk UBA displays several ETL status codes to help you troubleshoot any issues with the ETL services.
ETL-1-RawDataTopic
Status | WARN |
Error message | Indicator is in WARN state for one of two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> |
What is happening | Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when timeLag > warnThreshold % of topic's retention for a specific number of polls (pollcount).
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
ETL-3-RawDataTopic
Status | WARN |
Error message | Instance has not consumed events for <x mins/hrs>. |
What is happening | The instance has not consumed any events during the specified period of time. This warning is triggered when the timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warn.pollcount number of polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
UBA Identity Service (IR)
Splunk UBA displays several identity resolution (IR) status codes to help you troubleshoot any issues with the IR services.
IR-1-PreIREventTopic
Status | WARN |
Error message | Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> |
What is happening | Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when the timeLag > warnThreshold % of topic's retention for warn.pollcount number of polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
IR-3-PreIREventTopic
Status | WARN |
Error message | Instance has not consumed events for <x mins/hrs>. |
What is happening | The instance has not consumed any events during the specified period of time.
This warning is triggered when instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
IR-1-IRTopic
Status | WARN |
Error message | Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> |
What is happening | Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when timeLag > warnThreshold % of topic's retention for warnPollCount number of polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
IR-3-IRTopic
Status | WARN |
Error message | Instance has not consumed events for <x mins/hrs>. |
What is happening | The instance has not consumed any events during the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for badPollCount number of polls.
You can perform the following tasks to customize the thresholds for when a warning is triggered:
|
What to do | Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support. |
Monitor the health of your Splunk UBA deployment | Collect diagnostic data from your Splunk UBA deployment |
This documentation applies to the following versions of Splunk® User Behavior Analytics: 5.0.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.4.1, 5.0.5, 5.0.5.1, 5.1.0, 5.1.0.1
Feedback submitted, thanks!