Health Monitor status code reference

Use the status codes provided by Splunk UBA in the Health Monitor dashboard to help troubleshoot issues in your deployment. For more information about the dashboard, see Monitor the health of your Splunk UBA deployment .

Health monitor indicators do not stop any processes from running. For example, if a rule's execution time has exceeded a threshold and a warning is generated, its execution is not interrupted.

Analytics Writer Service (ANL_WRT)

Splunk UBA displays several analytics writer (ANL_WRT) status codes to help you troubleshoot any issues with the analytics writer services.

ANL_WRT-1-AnalyticsTopic

Status	WARN
Error message	Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> 2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon.
What is happening	Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: warnThreshold: `ubaMonitor.ANL_WRT.AnalyticsTopic.timeLag.warn.threshold` (default is 80 percent) warnPollCount: `ubaMonitor.ANL_WRT.AnalyticsTopic.timeLag.warn.pollcount` (default is 6 polls) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ANL_WRT-3-AnalyticsTopic

Status	WARN
Error message	Instance has not consumed events for <x mins/hrs>.
What is happening	The instance has not consumed any events in the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.ANL_WRT.AnalyticsTopic.instance.timeLag.warn.pollcount` property to customize the threshold for when a warning is triggered. The default is 12 polls. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ANL_WRT-1-IRTopic

Status	WARN
Error message	Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> 2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon
What is happening	Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: warnThreshold: `ubaMonitor.ANL_WRT.IRTopic.timeLag.warn.threshold` (default is 80 percent) warnPollCount: `ubaMonitor.ANL_WRT.IRTopic.timeLag.warn.pollcount` (default is 6 polls) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ANL_WRT-3-IRTopic

Status	WARN
Error message	Instance has not consumed events for <x mins/hrs>.
What is happening	The instance has not consumed any events in the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.ANL_WRT.IRTopic.instance.timeLag.warn.pollcount` property to customize the threshold for when a warning is triggered. The default is 12 polls. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

Data Sources (DS)

Splunk UBA displays several data source (DS) status codes to help you troubleshoot any issues with the data source services.

DS-1

Status	ERROR
Error message	EPS of one/more Data Sources was 0 for over a day.
What is happening	Zero events per second were processed for one or more low-frequency data sources for more than 5 days, or high-frequency data sources for more than 1 day. Each data source is categorized as low-frequency or high-frequency during a training period of 10 days, where it is polled once per hour (240 polls). Configure this threshold using the `ubaMonitor.DS.datasourceEPS.training.pollCount` property. If the EPS is below dsEpsFreqThreshold consistently during the training period, the data source is categorized as a low-frequency data source. If not, the data source is categorized as a high-frequency data source. After the training period, if the EPS of a high-freguency data source is 0 for a specific number of polls (highFreqBadPollCount), this error is generated. After the training period, if the EPS of low-frequency data source is 0 for a specific number of polls (lowFreqBadPollCount), this error is generated. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: dsEpsFreqThreshold: `ubaMonitor.DS.datasourceEPS.freq.threshold` (default is 50) highFreqBadPollCount: `ubaMonitor.DS.datasourceEPS.highFreq.bad.pollCount` (default is 24 polls, or 1 day) lowFreqBadPollCount: `ubaMonitor.DS.datasourceEPS.lowFreq.bad.pollCount` (default is 120 polls, or 5 days) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Restart data source(s) in BAD state, making sure events are sent to Splunk UBA. Contact Splunk Support if the status does not return to OK.

DS-2

Status	WARN
Error message	EPS of one/more Data Sources was 0 for over 6 hours.
What is happening	Zero events per second were processed for one or more low-frequency data sources for more than 3 days or high-frequency data sources for more than 6 hours. Each data source is categorized as low-frequency or high-frequency during a training period of 10 days, where it is polled once per hour (240 polls). Configure this threshold using the `ubaMonitor.DS.datasourceEPS.training.pollCount` property. If the EPS is below dsEpsFreqThreshold consistently during the training period, the data source is categorized as a low-frequency data source. If not, the data source is categorized as a high-frequency data source. After the training period, if the EPS of a high-freguency data source is 0 for a specific number of polls (highFreqWarnPollCount), generate this error. After the training period, if the EPS of low-frequency data source is 0 for a specific number of polls (lowFreqWarnPollCount), generate this error. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: dsEpsFreqThreshold: `ubaMonitor.DS.datasourceEPS.freq.threshold` (default is 50) highFreqWarnPollCount: `ubaMonitor.DS.datasourceEPS.highFreq.warn.pollCount` (default is 6 polls, or 6 hours) lowFreqWarnPollCount: `ubaMonitor.DS.datasourceEPS.lowFreq.warn.pollCount` (default is 72 polls, or 3 days) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 18 hours. If the status does not go back to OK or keeps fluctuating, restart data source(s) in WARN state. Check that events are sent to Splunk UBA.

DS-3

Status	ERROR
Error message	Number of skipped events is more than the configured threshold.
What is happening	The number of skipped events (for example, Unknown, EpochTooLow, EpochTooHigh, EpochDefault, EventFiltered, EventHasNoEntities, IpAddressOutOfRange, AllUsersUnknown, AD3DigitGroupEventCode, PANIgnoredSesssionEvent, DuplicateEvent, EventNotRelevant) exceeds the configured threshold. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.DS.skippedEvents.bad.threshold` property to customize the threshold for which an error is triggered. The default configured threshold is 75. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	In Splunk UBA, go to Manage > Data Sources. Under each data source, determine why the events are skipped. Contact Splunk Support if the reason is neither UI event filters nor HR data.

DS-5

Status	ERROR
Error message	No events were processed for one or more data formats.
What is happening	For each data format, the average frequency of events stored in Impala per poll is calculated in the default training period of 10 polls. You can customize this number of polls using the `ubaMonitor.DS.eventsPerFormat.trainingPeriod` property. If the frequency is above 1000, the data format is categorized as high frequency. Otherwise, it is categorized as low frequency. You can customize this threshold using the `ubaMonitor.DS.eventsPerFormat.freq.threshold` property. After the training period, if number of events for any high-frequency data format is 0, generate this error. After the training period, if number of events for any low frequency data format is 0 for `ubaMonitor.DS.eventsPerFormat.lowFreq.NOK` number of consecutive polls, generate this error. The default is 7.
What to do	Verify that the Splunk platform is ingesting data. If data is being ingested: Restart the data source for that format. Contact Splunk Support if no events are being processed for the data format after data source restart. If data is not being ingested by the Splunk platform, fix data on-boarding on the Splunk platform.

DS-6

Status	WARN
Error message	HR data ingestion has warnings.
What is happening	Possible causes: The HR data source is not defined. Ingestion of HR data has not been executed for more than the configured `ubaMonitor.DS.HR.poll.warn.threshold`. The default threshold is 48 hours. The last HR data job took longer than the configured `ubaMonitor.DS.HR.poll.max.duration`. The default threshold is 3 hours. The percentage of events that failed to be parsed by any of the jobs is equal or greater than `ubaMonitor.DS.HR.poll.max.failed.perc`. The default threshold is 0.05, or 5 percent.
What to do	Verify that the HR data source is configured properly and is ingesting events by going to the Data Sources page in Splunk UBA. See Validate HR data configuration before adding other data sources. If events are being properly ingested, restart the HR data source. If the warning persists, contact Splunk Support. If events are not being ingested, fix the HR data source ingestion on the Splunk platform.

DS-7

Status	BAD
Error message	HR data ingestion has errors.
What is happening	The amount of time since the latest execution of any of the jobs is more than `ubaMonitor.DS.HR.poll.bad.threshold`. The default threshold is 72 hours.
What to do	Restart the HR data source. If the warning persists, contact Splunk Support.

DS-8

Status	WARN
Error message	Asset data ingestion has warnings.
What is happening	Possible causes: Ingestion of assets data has not been executed for more than the configured `ubaMonitor.asset.HR.poll.warn.threshold`. The default threshold is 48 hours. The last asset data job took longer than the configured `ubaMonitor.DS.asset.poll.max.duration`. The default threshold is 3 hours. The percentage of events that failed to be parsed by any of the jobs is equal or greater than `ubaMonitor.DS.asset.poll.max.failed.perc`. The default threshold is 0.05, or 5 percent.
What to do	Verify that the assets data source is configured properly and is ingesting events by visiting Data Sources page in Splunk UBA. If events are being properly ingested, restart the assets data source. If the warning persists, contact Splunk Support. If events are not being ingested, fix the assets data source ingestion on the Splunk platform.

DS-9

Status	BAD
Error message	Asset data ingestion has errors.
What is happening	The amount of time since the latest execution of any of the jobs is more than `ubaMonitor.DS.asset.poll.bad.threshold`. The default threshold is 72 hours.
What to do	Restart the assets data source. If the warning persists, contact Splunk Support.

ENUM_MISMATCH_WARN

Status	WARN
Error message	enum mismatch beyond warn threshold.
What is happening	The ratio of bad events over total events is between 0.1 and 0.2.
What to do	Stop the affected data source and make sure Splunk UBA is able to understand enum fields. Take one of two actions: Modify the SPL to make sure values in enum fields match what is expected in `normalize.rules` file. Update `normalize.rules` to enable Splunk UBA to understand incoming data. For more information, see Monitor the quality of data sent from the Splunk platform in Get Data into Splunk User Behavior Analytics.

ENUM_MISMATCH_BAD

Status	BAD
Error message	enum mismatch beyond error threshold.
What is happening	The ratio of bad events over total events is greater than 0.2.
What to do	On the primary node, copy `/opt/caspida/conf/normalize.rules` to `/etc/caspida/local/conf`. Make modifications to this copy and not to the original file. Stop the affected data source and make sure Splunk UBA is able to understand enum fields. Take one of two actions: Modify SPL to make sure values in enum fields match what is expected in `normalize.rules` file. Update `normalize.rules` to enable Splunk UBA to understand incoming data. After making your changes and saving them, run the following: sync /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf/ For more information, see Monitor the quality of data sent from the Splunk platform in Get Data into Splunk User Behavior Analytics'

DS_LAGGING_WARN

Status	WARN
Error message	Data source micro-batch search lag is over the threshold.
What is happening	Data source services monitor the Splunk data source ingestion search lag, including Kafka data ingestion. The lag is defined as the duration between search submission time and the search's latest time. If lag is beyond 3600 seconds, warning message is displayed. See Configure Kafka data ingestion in the Splunk UBA Kafka Ingestion App manual. Configure this threshold by adding or editing the `splunk.kafka.ingestion.search.max.lag.seconds` property.
What to do	Stop the affected data source and try to split it into multiple sources to keep each data source's EPS small.

DS_SEARCH_STATUS_BAD

Status	BAD
Error message	Data source micro-batch search has returned an error.
What is happening	Data source services monitor the Splunk data source ingestion search status, including Kafka data ingestion. This indicator tracks if the search issued by the data source is healthy. An alert is triggered when the number of times the search returns any error is more than `ubaMonitor.DS.search.highFreq.bad.pollCount` times in a row. The default is 3 times in a row. Configure this threshold by adding or editing the `ubaMonitor.DS.search.highFreq.bad.pollCount` property.
What to do	Stop the affected data source and check on the Splunk platform's job inspector for more debug information.

Kafka Broker (KAFKA)

Splunk UBA displays one Kafka status code to help you troubleshoot any issues with the Kafka broker.

KAFKA-1

Status	ERROR
Error message	Kafka topics are not receiving events.
What is happening	There may be an issue in `kafka-server` resulting in events not writing to their topics. This error is triggered when data is being ingested but the indicator item values are null (when there are exceptions from KafkaJMXClient) for badPollCount number of polls. To check if data is being ingested, check if the difference between eventsIngested by connectors during first poll required for status change(last pollCount #) and the latest poll is greater than eventsIngestedDiffThreshold. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: badPollCount: `ubaMonitor.kafka.bytesIn.bad.pollCount` (default is 3 polls) eventsIngestedDiffThreshold: `ubaMonitor.kafka.bytesIn.eventsIngested.diff.threshold` (default is 100) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Monitor for an hour, and if the status does not return to OK, restart `kafka-server`. If Kafka topics continue to not receive events, contact Splunk Support.

Offline Rule Executor (ORE)

Splunk UBA displays several offline rule executor (ORE) status codes to help you troubleshoot any issues with the offline rule executor services.

ORE-1

Status	WARN
Error message	One or more rules have failed to run.
What is happening	One or more custom threats, anomaly action rules, or anomaly rules failed to run in the current session. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.ore.rulesFailures.warn.ruleCount` property to to set the number of rules that must fail to run before a warning is generated. The default is 1. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Monitor the number of failed runs compared to the total number of runs for the rule over the following days. If the number of failed runs increases or remains steady, contact Splunk Support. There is no need to contact Splunk Support if the number of failed runs decreases.

ORE-2

Status	ERROR
Error message	One or more rules consistently failed to run.
What is happening	One or more custom threats, anomaly action rules, or anomaly rules have consistently failed to run in the current session. This error is triggered when more than 20 percent of the executions of at least 1 rule have failed in the current session. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.ore.rulesFailures.bad.rulePerc` property to customize the threshold for when an error is triggered. The default is 20 percent. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-offlineruleexec restart
What to do	Contact Splunk Support.

ORE-3

Status	WARN
Error message	Average execution time per rule
What is happening	At least one custom threat, anomaly action rule, or anomaly rule has exceeded the warning threshold.
What to do	If the rules in WARN status are executing without failures, edit the `ubaMonitor.ore.ruleDuration.warn.duration` property to increase the time threshold before a warning is generated. The default is 30 minutes. After setting the property: Sync the cluster. /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf/ Restart the offline rule executor. sudo service caspida-offlineruleexec restart Click on the link in the Average Execution Time per Rule on the Data Quality Indicators page to check the rule execution times.

ORE-4

Status	WARN
Error message	Threat revalidation is slower than normal.
What is happening	During the current Realtime Rule Executor session, it is taking longer than a specified threshold of time to revalidate threats based on changes to anomalies, such as new scores, deleted anomalies, or suppressed anomalies. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.rre.threatReval.warn.duration` property to customize the threshold for when a warning is triggered. The default is 20 minutes. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-offlineruleexec restart
What to do	Contact Splunk Support.

Realtime Rule Executor (RRE)

Splunk UBA displays several realtime rule executor (RRE) status codes to help you troubleshoot any issues with the realtime rule executor services.

RRE-1

Status	WARN
Error message	New anomalies are being processed slowly and the processing speed is slowing down.
What is happening	The average events per second for processing new anomalies is slowing. High numbers of suppressed anomalies affects processing new anomalies. This warning is triggered when the 10 minutes average EPS keeps dropping (or increasing by less than 0.2) for the last number of (pollCount) polls and in all polls it is below the WARN configurable threshold. If no threshold is specified, this status is disabled. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: threshold: `ubaMonitor.RRE.movingAvgNewAnomEPS.warn` (initialized with 1) pollCount: `ubaMonitor.RRE.movingAvgNewAnomEPS.warn.polls` (default is 12) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-realtimeruleexec restart
What to do	Restart the Realtime Rule Executor service. On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed. Log in to that server as the caspida user using SSH. Run the following commands: sudo service caspida-realtimeruleexec stop sudo service caspida-realtimeruleexec start If the error continues to appear, check the number of suppressed anomalies on the Data Quality Metrics page on the Health Monitor dashboard. If the number of suppressed anomalies is in the millions, delete some of the suppressed anomalies. See Delete anomalies in Splunk UBA.

RRE-2

Status	ERROR
Error message	New anomalies are no longer being processed.
What is happening	The processing speed of new anomalies is low enough that no anomalies were processed in the last 10 minute period. This error is triggered when the 10 minutes average EPS keeps dropping (or increasing by less than 0.2) for the last number of (pollCount) polls and in all polls it is below the BAD configurable threshold. If no threshold is specified, this status is disabled. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: threshold: `ubaMonitor.RRE.movingAvgNewAnomEPS.bad` (initialized with 0.1) pollCount: `ubaMonitor.RRE.movingAvgNewAnomEPS.bad.polls` (default is 12) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-realtimeruleexec restart
What to do	Restart the Realtime Rule Executor service. On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed. Log in to that server as the caspida user using SSH. Run the following commands: sudo service caspida-realtimeruleexec stop sudo service caspida-realtimeruleexec start If the error continues to appear, check the number of suppressed anomalies on the Data Quality Metrics page on the Health Monitor dashboard. If the number of suppressed anomalies is in the millions, delete some of the suppressed anomalies. See Delete anomalies in Splunk UBA.

RRE-3

Status	WARN
Error message	Minor anomaly loss has been detected.
What is happening	Kafka has started to drop events from NewAnomalyTopic. This warning is triggered when the percentage of missed events is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: from: `ubaMonitor.RRE.missed.warn.threshold` (default is 1) to: `ubaMonitor.RRE.missed.bad.threshold` (default is 5) pollCount: `ubaMonitor.RRE.missed.warn.polls` (default is 4) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-realtimeruleexec restart
What to do	Restart the Realtime Rule Executor. The indicator's status will reset and its status recalculated in the next hour. If the status returns to WARN, contact Splunk Support. To restart the Realtime Rule Executor, perform the following steps: On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed. Log in to that server as the caspida user using SSH. Run the following commands: sudo service caspida-realtimeruleexec stop sudo service caspida-realtimeruleexec start

RRE-4

Status	ERROR
Error message	A significant percentage of anomalies is being dropped by Kafka.
What is happening	Kafka is dropping a significant number of events from NewAnomalyTopic. This error is triggered when the percentage of missed events exceeds the limit% (inclusive) for the last number of (pollCount) polls. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: limit: `ubaMonitor.RRE.missed.bad.threshold` (default is 5) pollCount: `ubaMonitor.RRE.missed.bad.polls` (default is 4) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-realtimeruleexec restart
What to do	Restart the Realtime Rule Executor. The indicator's status will reset and its status recalculated in the next hour. If the status returns to WARN, contact Splunk Support. To restart the Realtime Rule Executor, perform the following steps: On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed. Log in to that server as the caspida user using SSH. Run the following commands: sudo service caspida-realtimeruleexec stop sudo service caspida-realtimeruleexec start

PostgreSQL (PSQL)

Splunk UBA displays one PostgreSQL (PSQL) status code to help you troubleshoot any issues with the PostgreSQL service.

PSQL-1

Status	WARN
Error message	The number of suppressed anomalies is too high.
What is happening	A high volume of suppressed anomalies has led to slow system performance. This warning is triggered when the number of suppressed anomalies in PostgreSQL surpassed a configurable threshold. If no threshold is specified, this status is disabled. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.postgres.supprAnom.warn` property to customize the threshold for when a warning is triggered. The default is 5 million. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Reduce the number of suppressed anomalies to improve performance. Delete anomalies from the anomalies trash. See Delete anomalies in Splunk UBA.

Offline Models (OML)

Splunk UBA displays several offline model (OML) status codes to help you troubleshoot any issues with the offline model services.

OML-1

Status	WARN
Error message	One or more models have not executed for (x) hours.
What is happening	One or more models have not executed successfully in the past 2 days. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: pollTime: `ubaMonitor.OML.exec.time.poll.hour` (default is 22, or 10:00PM) warn pollCount: `ubaMonitor.OML.exec.time.warn.poll.count` (default is 1 poll) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 48 hours. If the OK status does not return during that time, contact Splunk Support.

OML-2

Status	ERROR
Error message	One or more models have not executed for the past 3 days.
What is happening	One or more models have not executed successfully in the specified interval of time. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: pollTime: `ubaMonitor.OML.exec.time.poll.hour` (default is 22, or 10:00PM) bad pollCount: `ubaMonitor.OML.exec.time.bad.poll.count` (default is 3 polls) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Contact Splunk Support.

Output Connector Service (OCS)

Splunk UBA displays several output connector service (OCS) status codes to help you troubleshoot any issues with the output connector services.

OCS-4

Status	ERROR
Error message	Percentage of email failures is more than the configured threshold.
What is happening	The percentage of email failures exceeds the configured threshold. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.OCS.emailFailures.bad.pollcount` property to customize the percentage of email failures before an error is triggered. The default is 80 percent. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-outputconnector restart
What to do	In Splunk UBA, go to Manage > Output Connectors and verify that the Email Connector is correctly configured.

OCS-5

Status	WARN
Error message	Minor anomaly loss has been detected.
What is happening	Kafka is dropping events from AnomalyTopic. This warning is triggered when the percentage of missed anomalies is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: from: `ubaMonitor.OCS.anomalies.missed.warn.threshold` (default is 1) to: `ubaMonitor.OCS.anomalies.missed.bad.threshold` (default is 5) pollCount: `ubaMonitor.OCS.anomalies.missed.warn.polls` (default is 4) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-outputconnector restart
What to do	Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support. To restart the Output Connector Server: Find the node on which it runs from the Cluster Services page. Log in to this node as caspida user and run the following commands: sudo service caspida-outputconnector stop sudo service caspida-outputconnector start

OCS-6

Status	ERROR
Error message	A significant percentage of anomalies is being dropped from Kafka.
What is happening	Kafka is dropping a significant percentage of events from AnomalyTopic. This error is triggered when the percentage of missed anomalies exceeds limit% (inclusive) for the last number of (pollCount) polls. You can perform the following tasks to customize the thresholds for when the error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: limit: `ubaMonitor.OCS.anomalies.missed.bad.threshold` (default is 5) pollCount: `ubaMonitor.OCS.anomalies.missed.bad.polls` (default is 4) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-outputconnector restart
What to do	Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support. To restart the Output Connector Server: Find the node on which it runs from the Cluster Services page. Log in to this node as caspida user and execute: sudo service caspida-outputconnector stop sudo service caspida-outputconnector start

OCS-8

Status	WARN
Error message	Minor event loss has been detected.
What is happening	Kafka is dropping events from OutputConnectorTopic. This warning is triggered when the percentage of missed events is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: from: `ubaMonitor.OCS.events.missed.warn.threshold` (default is 1) to: `ubaMonitor.OCS.events.missed.bad.threshold` (default is 5) pollCount: `ubaMonitor.OCS.events.missed.warn.polls` (default is 4) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-outputconnector restart
What to do	Restart the Output Connector Server. The indicator's status will be reset and in the next hour, its status will be recalculated. If the status goes back to WARN again, contact Splunk Support. To restart the Output Connector Server: Find the node on which it runs from the Cluster Services page. Log in to this node as caspida user and execute: sudo service caspida-outputconnector stop sudo service caspida-outputconnector start

OCS-9

Status	ERROR
Error message	A significant percentage of events are dropping from Kafka.
What is happening	Kafka is dropping a significant percentage of events from OutputConnectorTopic. This error is triggered when the percentage of missed events exceeds limit% (inclusive) for the last number of (pollCount) polls. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: limit: `ubaMonitor.OCS.events.missed.bad.threshold` (default is 5) pollCount: `ubaMonitor.OCS.events.missed.bad.polls` (default is 4) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-outputconnector restart
What to do	Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support. To restart the Output Connector Server: Find the node on which it runs from the Cluster Services page. Log in to this node as caspida user and execute: sudo service caspida-outputconnector stop sudo service caspida-outputconnector start

OCS-11

Status	ERROR
Error message	The latest threat feed has been halted.
What is happening	Splunk UBA has not been able to send threats to Splunk for the past hour. This error is triggered when the same batch of threats has failed to be sent to Splunk ES for the last number of (badPollCount) polls. You can perform the following tasks to customize the thresholds for when an error is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.OCS.threats.stuck.bad.polls` property to customize the thresholds to trigger the error. The default is 12 polls. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-outputconnector restart
What to do	Verify that Splunk ES is up and running. Restart the Output Connector Server: Find the node on which the Output Connector Server is running from the Cluster Services page. Log in to this node as caspida user and execute: sudo service caspida-outputconnector stop sudo service caspida-outputconnector start The indicator's status will reset and its status will recalculate in the next hour. If the status returns to ERROR, contact Splunk Support.

Streaming Models (SML)

Splunk UBA displays several streaming model (SML) status codes to help you troubleshoot any issues with the streaming model services.

SML-1

Status	WARN
Error message	Time lag of the consumer exceeded x% of topic's retention(<topicRetention>) for <x mins/hrs>. Consumer may start missing events soon.
What is happening	Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when either of the following conditions are met: One or more of instance time lag indicators on the topic is WARN. The `timeLag` is greater than `warnThreshold` % of the topic's retention for `warnPollCount` number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: warnThreshold: `ubaMonitor.SML.<topic>.timeLag.warn.threshold` (default is 80 percent) warnPollCount: `ubaMonitor.SML.<topic>.timeLag.warn.pollcount` (default is 6 polls) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

SML-3

Status	WARN
Error message	Instance has not consumed events for <x mins/hrs>.
What is happening	The instance has not consumed any events in the specified interval of time. This warning is triggered when instance's `timeLag` on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for `warnPollCount` number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.SML.<firstModelInGroup>.<topic>.instance.timeLag.warn.pollcount` property to customize the threshold for when a warning is triggered. The default is 24 polls. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

Threat Computation Task (TC)

Splunk UBA displays several threat computation (TC) status codes to help you troubleshoot any issues with the threat computation task.

TC-1

Status	WARN
Error message	Threat computation is taking more than a certain number of minutes to complete.
What is happening	The last time that the threat computation process ran, it took more than `duration` number of minutes to compute threats in Splunk UBA. High system loads can contribute to a longer time period for threat computation. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.threatComputation.duration.warn` property to customize the threshold for when a warning is triggered. The default is 120 minutes. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf The threat computation task is run as part of `caspida-jobmanager` and no restart is required for property changes to take effect.
What to do	Monitor this indicator for the next day or two. If it continues to appear, contact Splunk Support.

TC-2

Status	WARN
Error message	Graph computation and threat calculation is taking more than a certain number of minutes to complete.
What is happening	Graph computation is slowing the process of threat computation. High system load can contribute to a longer time period for graph computation of threats. This warning is triggered when the graph computation part of the latest invocation of the task took more than `graphDuration` minutes to complete. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.threatComputation.graphDuration.warn` property to customize the threshold for when a warning is triggered. The default is 90 minutes. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf The threat computation task is run as part of `caspida-jobmanager` and no restart is required for property changes to take effect.
What to do	Monitor this indicator for the next day or two. If it continues to appear, contact Splunk Support.

TC-3

Status	WARN
Error message	The most recent threat computation failed.
What is happening	The most recent process to compute new threats failed to run.
What to do	The task runs hourly. If threat computation fails regularly or often, contact Splunk Support.

UBA ETL Service (ETL)

Splunk UBA displays several ETL status codes to help you troubleshoot any issues with the ETL services.

ETL-1-RawDataTopic

Status	WARN
Error message	Indicator is in WARN state for one of two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> 2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon
What is happening	Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when timeLag > warnThreshold % of topic's retention for a specific number of polls (pollcount). You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: warnThreshold: `ubaMonitor.ETL.RawDataTopic.timeLag.warn.threshold` (default is 80 percent) warnPollCount: `ubaMonitor.ETL.RawDataTopic.timeLag.warn.pollcount` (default is 6 polls) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ETL-3-RawDataTopic

Status	WARN
Error message	Instance has not consumed events for <x mins/hrs>.
What is happening	The instance has not consumed any events during the specified period of time. This warning is triggered when the timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warn.pollcount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.ETL.RawDataTopic.instance.timeLag.warn.pollcount` property to customize the threshold for when a warning is triggered. The default is 12 polls. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

UBA Identity Service (IR)

Splunk UBA displays several identity resolution (IR) status codes to help you troubleshoot any issues with the IR services.

IR-1-PreIREventTopic

Status	WARN
Error message	Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> 2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon
What is happening	Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when the timeLag > warnThreshold % of topic's retention for warn.pollcount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: warnThreshold: `ubaMonitor.IR.PreIREventTopic.timeLag.warn.threshold` (default is 80 percent) warnPollCount: `ubaMonitor.IR.PreIREventTopic.timeLag.warn.pollcount` (default is 6 polls) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

IR-3-PreIREventTopic

Status	WARN
Error message	Instance has not consumed events for <x mins/hrs>.
What is happening	The instance has not consumed any events during the specified period of time. This warning is triggered when instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.IR.PreIREventTopic.instance.timeLag.warn.pollcount` property to customize the threshold for when a warning is triggered. The default is 12 polls. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

IR-1-IRTopic

Status	WARN
Error message	Indicator is in WARN state because of one of the two reasons: 1. One/more of the instances have not consumed events since <x mins/hrs> 2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon
What is happening	Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when timeLag > warnThreshold % of topic's retention for warnPollCount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the following properties: warnThreshold: `ubaMonitor.IR.RawDataTopic.timeLag.warn.threshold` (default is 80 percent) warnPollCount: `ubaMonitor.IR.RawDataTopic.timeLag.warn.pollcount` (default is 6 polls) Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

IR-3-IRTopic

Status	WARN
Error message	Instance has not consumed events for <x mins/hrs>.
What is happening	The instance has not consumed any events during the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for badPollCount number of polls. You can perform the following tasks to customize the thresholds for when a warning is triggered: Edit `/etc/caspida/local/conf/uba-site.properties` and add or edit the `ubaMonitor.IR.RawDataTopic.instance.timeLag.bad.pollcount` property to customize the threshold for when a warning is triggered. The default is 12 polls. Synchronize the cluster in distributed Splunk UBA deployments: /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf Restart the required services by running the following command on the management node: sudo service caspida-sysmon restart
What to do	Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

Related answers from Splunk Community

Health Monitor status code reference

Analytics Writer Service (ANL_WRT)

ANL_WRT-1-AnalyticsTopic

ANL_WRT-3-AnalyticsTopic

ANL_WRT-1-IRTopic

ANL_WRT-3-IRTopic

Data Sources (DS)

DS-1

DS-2

DS-3

DS-5

DS-6

DS-7

DS-8

DS-9

ENUM_MISMATCH_WARN

ENUM_MISMATCH_BAD

DS_LAGGING_WARN

DS_SEARCH_STATUS_BAD

Kafka Broker (KAFKA)

KAFKA-1

Offline Rule Executor (ORE)

ORE-1

ORE-2

ORE-3

ORE-4

Realtime Rule Executor (RRE)

RRE-1

RRE-2

RRE-3

RRE-4

PostgreSQL (PSQL)

PSQL-1

Offline Models (OML)

OML-1

OML-2

Output Connector Service (OCS)

OCS-4

OCS-5

OCS-6

OCS-8

OCS-9

OCS-11

Streaming Models (SML)

SML-1

SML-3

Threat Computation Task (TC)

TC-1

TC-2

TC-3

UBA ETL Service (ETL)

ETL-1-RawDataTopic

ETL-3-RawDataTopic

UBA Identity Service (IR)

IR-1-PreIREventTopic

IR-3-PreIREventTopic

IR-1-IRTopic

IR-3-IRTopic

Comments

Health Monitor status code reference

Was this topic useful?