Splunk® User Behavior Analytics

Administer Splunk User Behavior Analytics

Acrobat logo Download manual as PDF


This documentation does not apply to the most recent version of Splunk® User Behavior Analytics. For documentation on the most recent version, go to the latest release.
Acrobat logo Download topic as PDF

Health Monitor status code reference

Use the status codes provided by Splunk UBA in the Health Monitor dashboard to help troubleshoot issues in your deployment. See Monitor the health of your Splunk UBA deployment for information about the dashboard.

Health monitor indicators do not stop any processes from running. For example, if a rule's execution time has exceeded a threshold and a warning is generated, its execution is not interrupted.

Analytics Writer Service (ANL_WRT)

Splunk UBA displays several analytics writer (ANL_WRT) status codes to help you troubleshoot any issues with the analytics writer services.

ANL_WRT-1-AnalyticsTopic

Status WARN
Error message Indicator is in WARN state because of one of the two reasons:

1. One/more of the instances have not consumed events since <x mins/hrs>
2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon.

What is happening Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • warnThreshold: ubaMonitor.ANL_WRT.AnalyticsTopic.timeLag.warn.threshold (default is 80 percent)
    • warnPollCount: ubaMonitor.ANL_WRT.AnalyticsTopic.timeLag.warn.pollcount (default is 6 polls)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ANL_WRT-3-AnalyticsTopic

Status WARN
Error message Instance has not consumed events for <x mins/hrs>.
What is happening The instance has not consumed any events in the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.ANL_WRT.AnalyticsTopic.instance.timeLag.warn.pollcount property to customize the threshold for when a warning is triggered. The default is 12 polls.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ANL_WRT-1-IRTopic

Status WARN
Error message Indicator is in WARN state because of one of the two reasons:

1. One/more of the instances have not consumed events since <x mins/hrs>
2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon

What is happening Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • warnThreshold: ubaMonitor.ANL_WRT.IRTopic.timeLag.warn.threshold (default is 80 percent)
    • warnPollCount: ubaMonitor.ANL_WRT.IRTopic.timeLag.warn.pollcount (default is 6 polls)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ANL_WRT-3-IRTopic

Status WARN
Error message Instance has not consumed events for <x mins/hrs>.
What is happening The instance has not consumed any events in the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.ANL_WRT.IRTopic.instance.timeLag.warn.pollcount property to customize the threshold for when a warning is triggered. The default is 12 polls.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

Data Sources (DS)

Splunk UBA displays several data source (DS) status codes to help you troubleshoot any issues with the data source services.

DS-1

Status ERROR
Error message EPS of one/more Data Sources was 0 for over a day.
What is happening Zero events per second were processed for one or more low-frequency data sources for more than 5 days, or high-frequency data sources for more than 1 day.

Each data source is categorized as low-frequency or high-frequency during a training period of 10 days, where it is polled once per hour (240 polls). Configure this threshold using the ubaMonitor.DS.datasourceEPS.training.pollCount property.

  • If the EPS is below dsEpsFreqThreshold consistently during the training period, the data source is categorized as a low-frequency data source. If not, the data source is categorized as a high-frequency data source.
  • After the training period, if the EPS of a high-freguency data source is 0 for a specific number of polls (highFreqBadPollCount), this error is generated.
  • After the training period, if the EPS of low-frequency data source is 0 for a specific number of polls (lowFreqBadPollCount), this error is generated.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • dsEpsFreqThreshold: ubaMonitor.DS.datasourceEPS.freq.threshold (default is 50)
    • highFreqBadPollCount: ubaMonitor.DS.datasourceEPS.highFreq.bad.pollCount (default is 24 polls, or 1 day)
    • lowFreqBadPollCount: ubaMonitor.DS.datasourceEPS.lowFreq.bad.pollCount (default is 120 polls, or 5 days)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Restart data source(s) in BAD state, making sure events are sent to Splunk UBA. Contact Splunk Support if the status does not return to OK.

DS-2

Status WARN
Error message EPS of one/more Data Sources was 0 for over 6 hours.
What is happening Zero events per second were processed for one or more low-frequency data sources for more than 3 days or high-frequency data sources for more than 6 hours.

Each data source is categorized as low-frequency or high-frequency during a training period of 10 days, where it is polled once per hour (240 polls). Configure this threshold using the ubaMonitor.DS.datasourceEPS.training.pollCount property.

  • If the EPS is below dsEpsFreqThreshold consistently during the training period, the data source is categorized as a low-frequency data source. If not, the data source is categorized as a high-frequency data source.
  • After the training period, if the EPS of a high-freguency data source is 0 for a specific number of polls (highFreqWarnPollCount), generate this error.
  • After the training period, if the EPS of low-frequency data source is 0 for a specific number of polls (lowFreqWarnPollCount), generate this error.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • dsEpsFreqThreshold: ubaMonitor.DS.datasourceEPS.freq.threshold (default is 50)
    • highFreqWarnPollCount: ubaMonitor.DS.datasourceEPS.highFreq.warn.pollCount (default is 6 polls, or 6 hours)
    • lowFreqWarnPollCount: ubaMonitor.DS.datasourceEPS.lowFreq.warn.pollCount (default is 72 polls, or 3 days)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 18 hours. If the status does not go back to OK or keeps fluctuating, restart data source(s) in WARN state. Check that events are sent to Splunk UBA.

DS-3

Status ERROR
Error message Number of skipped events is more than the configured threshold.
What is happening The number of skipped events (for example, Unknown, EpochTooLow, EpochTooHigh, EpochDefault, EventBlacklisted, EventFiltered, EventHasNoEntities, IpAddressOutOfRange, AllUsersUnknown, AD3DigitGroupEventCode, PANIgnoredSesssionEvent, DuplicateEvent, EventNotRelevant) exceeds the configured threshold.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.DS.skippedEvents.bad.threshold property to customize the threshold for which an error is triggered. The default configured threshold is 75.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do In Splunk UBA, go to Manage > Data Sources. Under each data source, determine why the events are skipped. Contact Splunk Support if the reason is neither UI event filters nor HR data.

DS-5

Status ERROR
Error message No events were processed for one or more data formats.
What is happening For each data format, the average frequency of events stored in Impala per poll is calculated in the default training period of 10 polls. You can customize this number of polls using the ubaMonitor.DS.eventsPerFormat.trainingPeriod property.
  • If the frequency is above 1000, the data format is categorized as high frequency. Otherwise, it is categorized as low frequency. You can customize this threshold using the ubaMonitor.DS.eventsPerFormat.freq.threshold property.
  • After the training period, if number of events for any high-frequency data format is 0, generate this error.
  • After the training period, if number of events for any low frequency data format is 0 for ubaMonitor.DS.eventsPerFormat.lowFreq.NOK number of consecutive polls, generate this error. The default is 7.
What to do Verify that the Splunk platform is ingesting data. If data is being ingested:
  • Restart the data source for that format.
  • Contact Splunk Support if no events are being processed for the data format after data source restart.

If data is not being ingested by the Splunk platform, fix data on-boarding on the Splunk platform.

DS-6

Status WARN
Error message HR data ingestion has warnings.
What is happening Possible causes:
  • The HR data source is not defined.
  • Ingestion of HR data has not been executed for more than the configured ubaMonitor.DS.HR.poll.warn.threshold. The default threshold is 48 hours.
  • The last HR data job took longer than the configured ubaMonitor.DS.HR.poll.max.duration. The default threshold is 3 hours.
  • The percentage of events that failed to be parsed by any of the jobs is equal or greater than ubaMonitor.DS.HR.poll.max.failed.perc. The default threshold is 0.05, or 5 percent.
What to do Verify that the HR data source is configured properly and is ingesting events by going to the Data Sources page in Splunk UBA. See Validate HR data configuration before adding other data sources.
  • If events are being properly ingested, restart the HR data source. If the warning persists, contact Splunk Support.
  • If events are not being ingested, fix the HR data source ingestion on the Splunk platform.

DS-7

Status BAD
Error message HR data ingestion has errors.
What is happening The amount of time since the latest execution of any of the jobs is more than ubaMonitor.DS.HR.poll.bad.threshold. The default threshold is 72 hours.
What to do Restart the HR data source. If the warning persists, contact Splunk Support.

DS-8

Status WARN
Error message Asset data ingestion has warnings.
What is happening Possible causes:
  • Ingestion of assets data has not been executed for more than the configured ubaMonitor.asset.HR.poll.warn.threshold. The default threshold is 48 hours.
  • The last asset data job took longer than the configured ubaMonitor.DS.asset.poll.max.duration. The default threshold is 3 hours.
  • The percentage of events that failed to be parsed by any of the jobs is equal or greater than ubaMonitor.DS.asset.poll.max.failed.perc. The default threshold is 0.05, or 5 percent.
What to do Verify that the assets data source is configured properly and is ingesting events by visiting Data Sources page in Splunk UBA.
  • If events are being properly ingested, restart the assets data source. If the warning persists, contact Splunk Support.
  • If events are not being ingested, fix the assets data source ingestion on the Splunk platform.

DS-9

Status BAD
Error message Asset data ingestion has errors.
What is happening The amount of time since the latest execution of any of the jobs is more than ubaMonitor.DS.asset.poll.bad.threshold. The default threshold is 72 hours.
What to do Restart the assets data source. If the warning persists, contact Splunk Support.

ENUM_MISMATCH_WARN

Status WARN
Error message enum mismatch beyond warn threshold.
What is happening The ratio of bad events over total events is between 0.1 and 0.2.
What to do Stop the affected data source and make sure Splunk UBA is able to understand enum fields. Take one of two actions:
  • Modify the SPL to make sure values in enum fields match what is expected in normalize.rules file.
  • Update normalize.rules to enable Splunk UBA to understand incoming data.

For more information, see Monitor the quality of data sent from the Splunk platform in Get Data into Splunk User Behavior Analytics.

ENUM_MISMATCH_BAD

Status BAD
Error message enum mismatch beyond error threshold.
What is happening The ratio of bad events over total events is greater than 0.2.
What to do Stop the affected data source and make sure Splunk UBA is able to understand enum fields. Take one of two actions:
  • Modify SPL to make sure values in enum fields match what is expected in normalize.rules file.
  • Update normalize.rules to enable Splunk UBA to understand incoming data.

For more information, see Monitor the quality of data sent from the Splunk platform in Get Data into Splunk User Behavior Analytics.

DS_LAGGING_WARN

Status WARN
Error message Data source micro-batch search lag is over the threshold.
What is happening Data source services monitor the Splunk data source ingestion search lag, including Kafka data ingestion. The lag is defined as the duration between search submission time and the search's latest time. If lag is beyond 3600 seconds, warning message is displayed.

See Configure Kafka data ingestion in the Splunk UBA Kafka Ingestion App manual.

Configure this threshold by adding or editing the splunk.kafka.ingestion.search.max.lag.seconds property.

What to do Stop the affected data source and try to split it into multiple sources to keep each data source's EPS small.

DS_SEARCH_STATUS_BAD

Status BAD
Error message Data source micro-batch search has returned an error.
What is happening Data source services monitor the Splunk data source ingestion search status, including Kafka data ingestion. This indicator tracks if the search issued by the data source is healthy. An alert is triggered when the number of times the search returns any error is more than ubaMonitor.DS.search.highFreq.bad.pollCount times in a row. The default is 3 times in a row.

Configure this threshold by adding or editing the ubaMonitor.DS.search.highFreq.bad.pollCount property.

What to do Stop the affected data source and check on the Splunk platform's job inspector for more debug information.

Kafka Broker (KAFKA)

Splunk UBA displays one Kafka status code to help you troubleshoot any issues with the Kafka broker.

KAFKA-1

Status ERROR
Error message Kafka topics are not receiving events.
What is happening There may be an issue in kafka-server resulting in events not writing to their topics. This error is triggered when data is being ingested but the indicator item values are null (when there are exceptions from KafkaJMXClient) for badPollCount number of polls. To check if data is being ingested, check if the difference between eventsIngested by connectors during first poll required for status change(last pollCount #) and the latest poll is greater than eventsIngestedDiffThreshold.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • badPollCount: ubaMonitor.kafka.bytesIn.bad.pollCount (default is 3 polls)
    • eventsIngestedDiffThreshold: ubaMonitor.kafka.bytesIn.eventsIngested.diff.threshold (default is 100)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Monitor for an hour, and if the status does not return to OK, restart kafka-server. If Kafka topics continue to not receive events, contact Splunk Support.

Offline Rule Executor (ORE)

Splunk UBA displays several offline rule executor (ORE) status codes to help you troubleshoot any issues with the offline rule executor services.

ORE-1

Status WARN
Error message One or more rules have failed to run.
What is happening One or more custom threats, anomaly action rules, or anomaly rules failed to run in the current session.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.ore.rulesFailures.warn.ruleCount property to to set the number of rules that must fail to run before a warning is generated. The default is 1.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Monitor the number of failed runs compared to the total number of runs for the rule over the following days. If the number of failed runs increases or remains steady, contact Splunk Support. There is no need to contact Splunk Support if the number of failed runs decreases.

ORE-2

Status ERROR
Error message One or more rules consistently failed to run.
What is happening One or more custom threats, anomaly action rules, or anomaly rules have consistently failed to run in the current session. This error is triggered when more than 20 percent of the executions of at least 1 rule have failed in the current session.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.ore.rulesFailures.bad.rulePerc property to customize the threshold for when an error is triggered. The default is 20 percent.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-offlineruleexec restart
What to do Contact Splunk Support.

ORE-3

Status WARN
Error message Average execution time per rule
What is happening At least one custom threat, anomaly action rule, or anomaly rule has exceeded the warning threshold.
What to do If the rules in WARN status are executing without failures, edit the ubaMonitor.ore.ruleDuration.warn.duration property to increase the time threshold before a warning is generated. The default is 30 minutes. After setting the property:
  1. Sync the cluster.
    /opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf/
  2. Restart the offline rule executor.
    sudo service caspida-offlineruleexec restart

Click on the link in the Average Execution Time per Rule on the Data Quality Indicators page to check the rule execution times.

ORE-4

Status WARN
Error message Threat revalidation is slower than normal.
What is happening During the current Realtime Rule Executor session, it is taking longer than a specified threshold of time to revalidate threats based on changes to anomalies, such as new scores, deleted anomalies, or suppressed anomalies.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.rre.threatReval.warn.duration property to customize the threshold for when a warning is triggered. The default is 20 minutes.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-offlineruleexec restart
What to do Contact Splunk Support.

Realtime Rule Executor (RRE)

Splunk UBA displays several realtime rule executor (RRE) status codes to help you troubleshoot any issues with the realtime rule executor services.

RRE-1

Status WARN
Error message New anomalies are being processed slowly and the processing speed is slowing down.
What is happening The average events per second for processing new anomalies is slowing. High numbers of suppressed anomalies affects processing new anomalies. This warning is triggered when the 10 minutes average EPS keeps dropping (or increasing by less than 0.2) for the last number of (pollCount) polls and in all polls it is below the WARN configurable threshold. If no threshold is specified, this status is disabled.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • threshold: ubaMonitor.RRE.movingAvgNewAnomEPS.warn (initialized with 1)
    • pollCount: ubaMonitor.RRE.movingAvgNewAnomEPS.warn.polls (default is 12)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-realtimeruleexec restart
What to do Restart the Realtime Rule Executor service.
  1. On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed.
  2. Log in to that server as the caspida user using SSH.
  3. Run the following commands:
    sudo service caspida-realtimeruleexec stop
    sudo service caspida-realtimeruleexec start

If the error continues to appear, check the number of suppressed anomalies on the Data Quality Metrics page on the Health Monitor dashboard. If the number of suppressed anomalies is in the millions, delete some of the suppressed anomalies. See Delete anomalies in Splunk UBA.

RRE-2

Status ERROR
Error message New anomalies are no longer being processed.
What is happening The processing speed of new anomalies is low enough that no anomalies were processed in the last 10 minute period. This error is triggered when the 10 minutes average EPS keeps dropping (or increasing by less than 0.2) for the last number of (pollCount) polls and in all polls it is below the BAD configurable threshold. If no threshold is specified, this status is disabled.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • threshold: ubaMonitor.RRE.movingAvgNewAnomEPS.bad (initialized with 0.1)
    • pollCount: ubaMonitor.RRE.movingAvgNewAnomEPS.bad.polls (default is 12)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-realtimeruleexec restart
What to do Restart the Realtime Rule Executor service.
  1. On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed.
  2. Log in to that server as the caspida user using SSH.
  3. Run the following commands:
    sudo service caspida-realtimeruleexec stop
    sudo service caspida-realtimeruleexec start

If the error continues to appear, check the number of suppressed anomalies on the Data Quality Metrics page on the Health Monitor dashboard. If the number of suppressed anomalies is in the millions, delete some of the suppressed anomalies. See Delete anomalies in Splunk UBA.

RRE-3

Status WARN
Error message Minor anomaly loss has been detected.
What is happening Kafka has started to drop events from NewAnomalyTopic. This warning is triggered when the percentage of missed events is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • from: ubaMonitor.RRE.missed.warn.threshold (default is 1)
    • to: ubaMonitor.RRE.missed.bad.threshold (default is 5)
    • pollCount: ubaMonitor.RRE.missed.warn.polls (default is 4)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-realtimeruleexec restart
What to do Restart the Realtime Rule Executor. The indicator's status will reset and its status recalculated in the next hour. If the status returns to WARN, contact Splunk Support.

To restart the Realtime Rule Executor, perform the following steps:

  1. On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed.
  2. Log in to that server as the caspida user using SSH.
  3. Run the following commands:
    sudo service caspida-realtimeruleexec stop
    sudo service caspida-realtimeruleexec start

RRE-4

Status ERROR
Error message A significant percentage of anomalies is being dropped by Kafka.
What is happening Kafka is dropping a significant number of events from NewAnomalyTopic. This error is triggered when the percentage of missed events exceeds the limit% (inclusive) for the last number of (pollCount) polls.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • limit: ubaMonitor.RRE.missed.bad.threshold (default is 5)
    • pollCount: ubaMonitor.RRE.missed.bad.polls (default is 4)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-realtimeruleexec restart
What to do Restart the Realtime Rule Executor. The indicator's status will reset and its status recalculated in the next hour. If the status returns to WARN, contact Splunk Support.

To restart the Realtime Rule Executor, perform the following steps:

  1. On the Services page of the Health Monitor dashboard, locate the server node with the Realtime Rule Executor service installed.
  2. Log in to that server as the caspida user using SSH.
  3. Run the following commands:
    sudo service caspida-realtimeruleexec stop
    sudo service caspida-realtimeruleexec start

PostgreSQL (PSQL)

Splunk UBA displays one PostgreSQL (PSQL) status code to help you troubleshoot any issues with the PostgreSQL service.

PSQL-1

Status WARN
Error message The number of suppressed anomalies is too high.
What is happening A high volume of suppressed anomalies has led to slow system performance. This warning is triggered when the number of suppressed anomalies in PostgreSQL surpassed a configurable threshold. If no threshold is specified, this status is disabled.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.postgres.supprAnom.warn property to customize the threshold for when a warning is triggered. The default is 5 million.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Reduce the number of suppressed anomalies to improve performance. Delete anomalies from the anomalies trash. See Delete anomalies in Splunk UBA.

Offline Models (OML)

Splunk UBA displays several offline model (OML) status codes to help you troubleshoot any issues with the offline model services.

OML-1

Status WARN
Error message One or more models have not executed for (x) hours.
What is happening One or more models have not executed successfully in the past 2 days.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • pollTime: ubaMonitor.OML.exec.time.poll.hour (default is 22, or 10:00PM)
    • warn pollCount: ubaMonitor.OML.exec.time.warn.poll.count (default is 1 poll)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 48 hours. If the OK status does not return during that time, contact Splunk Support.

OML-2

Status ERROR
Error message One or more models have not executed for the past 3 days.
What is happening One or more models have not executed successfully in the specified interval of time.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • pollTime: ubaMonitor.OML.exec.time.poll.hour (default is 22, or 10:00PM)
    • bad pollCount: ubaMonitor.OML.exec.time.bad.poll.count (default is 3 polls)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Contact Splunk Support.

Output Connector Service (OCS)

Splunk UBA displays several output connector service (OCS) status codes to help you troubleshoot any issues with the output connector services.

OCS-4

Status ERROR
Error message Percentage of email failures is more than the configured threshold.
What is happening The percentage of email failures exceeds the configured threshold.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.OCS.emailFailures.bad.pollcount property to customize the percentage of email failures before an error is triggered. The default is 80 percent.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-outputconnector restart
What to do In Splunk UBA, go to Manage > Output Connectors and verify that the Email Connector is correctly configured.

OCS-5

Status WARN
Error message Minor anomaly loss has been detected.
What is happening Kafka is dropping events from AnomalyTopic. This warning is triggered when the percentage of missed anomalies is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • from: ubaMonitor.OCS.anomalies.missed.warn.threshold (default is 1)
    • to: ubaMonitor.OCS.anomalies.missed.bad.threshold (default is 5)
    • pollCount: ubaMonitor.OCS.anomalies.missed.warn.polls (default is 4)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-outputconnector restart
What to do Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support.

To restart the Output Connector Server:

  1. Find the node on which it runs from the Cluster Services page.
  2. Log in to this node as caspida user and run the following commands:
    sudo service caspida-outputconnector stop
    sudo service caspida-outputconnector start
    

OCS-6

Status ERROR
Error message A significant percentage of anomalies is being dropped from Kafka.
What is happening Kafka is dropping a significant percentage of events from AnomalyTopic. This error is triggered when the percentage of missed anomalies exceeds limit% (inclusive) for the last number of (pollCount) polls.

You can perform the following tasks to customize the thresholds for when the error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • limit: ubaMonitor.OCS.anomalies.missed.bad.threshold (default is 5)
    • pollCount: ubaMonitor.OCS.anomalies.missed.bad.polls (default is 4)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-outputconnector restart
What to do Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support.

To restart the Output Connector Server:

  1. Find the node on which it runs from the Cluster Services page.
  2. Log in to this node as caspida user and execute:
    sudo service caspida-outputconnector stop
    sudo service caspida-outputconnector start
    

OCS-8

Status WARN
Error message Minor event loss has been detected.
What is happening Kafka is dropping events from OutputConnectorTopic. This warning is triggered when the percentage of missed events is between from% (inclusive) and to% (not inclusive) for the last number of (pollCount) polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • from: ubaMonitor.OCS.events.missed.warn.threshold (default is 1)
    • to: ubaMonitor.OCS.events.missed.bad.threshold (default is 5)
    • pollCount: ubaMonitor.OCS.events.missed.warn.polls (default is 4)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-outputconnector restart
What to do Restart the Output Connector Server. The indicator's status will be reset and in the next hour, its status will be recalculated. If the status goes back to WARN again, contact Splunk Support.

To restart the Output Connector Server:

  1. Find the node on which it runs from the Cluster Services page.
  2. Log in to this node as caspida user and execute:
    sudo service caspida-outputconnector stop
    sudo service caspida-outputconnector start
    

OCS-9

Status ERROR
Error message A significant percentage of events are dropping from Kafka.
What is happening Kafka is dropping a significant percentage of events from OutputConnectorTopic. This error is triggered when the percentage of missed events exceeds limit% (inclusive) for the last number of (pollCount) polls.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • limit: ubaMonitor.OCS.events.missed.bad.threshold (default is 5)
    • pollCount: ubaMonitor.OCS.events.missed.bad.polls (default is 4)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-outputconnector restart
What to do Restart the Output Connector Server. The indicator's status will reset and its status will recalculate in the next hour. If the status returns to WARN, contact Splunk Support.

To restart the Output Connector Server:

  1. Find the node on which it runs from the Cluster Services page.
  2. Log in to this node as caspida user and execute:
    sudo service caspida-outputconnector stop
    sudo service caspida-outputconnector start
    

OCS-11

Status ERROR
Error message The latest threat feed has been halted.
What is happening Splunk UBA has not been able to send threats to Splunk for the past hour. This error is triggered when the same batch of threats has failed to be sent to Splunk ES for the last number of (badPollCount) polls.

You can perform the following tasks to customize the thresholds for when an error is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.OCS.threats.stuck.bad.polls property to customize the thresholds to trigger the error. The default is 12 polls.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-outputconnector restart
What to do
  1. Verify that Splunk ES is up and running.
  2. Restart the Output Connector Server:
    1. Find the node on which the Output Connector Server is running from the Cluster Services page.
    2. Log in to this node as caspida user and execute:
      sudo service caspida-outputconnector stop
      sudo service caspida-outputconnector start
      

The indicator's status will reset and its status will recalculate in the next hour. If the status returns to ERROR, contact Splunk Support.

Streaming Models (SML)

Splunk UBA displays several streaming model (SML) status codes to help you troubleshoot any issues with the streaming model services.

SML-1

Status WARN
Error message Time lag of the consumer exceeded x% of topic's retention(<topicRetention>) for <x mins/hrs>. Consumer may start missing events soon.
What is happening Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when either of the following conditions are met:
  • One or more of instance time lag indicators on the topic is WARN.
  • The timeLag is greater than warnThreshold % of the topic's retention for warnPollCount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • warnThreshold: ubaMonitor.SML.<topic>.timeLag.warn.threshold (default is 80 percent)
    • warnPollCount: ubaMonitor.SML.<topic>.timeLag.warn.pollcount (default is 6 polls)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

SML-3

Status WARN
Error message Instance has not consumed events for <x mins/hrs>.
What is happening The instance has not consumed any events in the specified interval of time. This warning is triggered when instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.SML.<firstModelInGroup>.<topic>.instance.timeLag.warn.pollcount property to customize the threshold for when a warning is triggered. The default is 24 polls.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

Threat Computation Task (TC)

Splunk UBA displays several threat computation (TC) status codes to help you troubleshoot any issues with the threat computation task.

TC-1

Status WARN
Error message Threat computation is taking more than a certain number of minutes to complete.
What is happening The last time that the threat computation process ran, it took more than duration number of minutes to compute threats in Splunk UBA. High system loads can contribute to a longer time period for threat computation.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.threatComputation.duration.warn property to customize the threshold for when a warning is triggered. The default is 120 minutes.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf

The threat computation task is run as part of caspida-jobmanager and no restart is required for property changes to take effect.

What to do Monitor this indicator for the next day or two. If it continues to appear, contact Splunk Support.

TC-2

Status WARN
Error message Graph computation and threat calculation is taking more than a certain number of minutes to complete.
What is happening Graph computation is slowing the process of threat computation. High system load can contribute to a longer time period for graph computation of threats. This warning is triggered when the graph computation part of the latest invocation of the task took more than graphDuration minutes to complete.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.threatComputation.graphDuration.warn property to customize the threshold for when a warning is triggered. The default is 90 minutes.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf

The threat computation task is run as part of caspida-jobmanager and no restart is required for property changes to take effect.

What to do Monitor this indicator for the next day or two. If it continues to appear, contact Splunk Support.

TC-3

Status WARN
Error message The most recent threat computation failed.
What is happening The most recent process to compute new threats failed to run.
What to do The task runs hourly. If threat computation fails regularly or often, contact Splunk Support.

UBA ETL Service (ETL)

Splunk UBA displays several ETL status codes to help you troubleshoot any issues with the ETL services.

ETL-1-RawDataTopic

Status WARN
Error message Indicator is in WARN state for one of two reasons:

1. One/more of the instances have not consumed events since <x mins/hrs>
2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon

What is happening Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when timeLag > warnThreshold % of topic's retention for a specific number of polls (pollcount).

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • warnThreshold: ubaMonitor.ETL.RawDataTopic.timeLag.warn.threshold (default is 80 percent)
    • warnPollCount: ubaMonitor.ETL.RawDataTopic.timeLag.warn.pollcount (default is 6 polls)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

ETL-3-RawDataTopic

Status WARN
Error message Instance has not consumed events for <x mins/hrs>.
What is happening The instance has not consumed any events during the specified period of time. This warning is triggered when the timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warn.pollcount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.ETL.RawDataTopic.instance.timeLag.warn.pollcount property to customize the threshold for when a warning is triggered. The default is 12 polls.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

UBA Identity Service (IR)

Splunk UBA displays several identity resolution (IR) status codes to help you troubleshoot any issues with the IR services.

IR-1-PreIREventTopic

Status WARN
Error message Indicator is in WARN state because of one of the two reasons:

1. One/more of the instances have not consumed events since <x mins/hrs>
2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon

What is happening Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when the timeLag > warnThreshold % of topic's retention for warn.pollcount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • warnThreshold: ubaMonitor.IR.PreIREventTopic.timeLag.warn.threshold (default is 80 percent)
    • warnPollCount: ubaMonitor.IR.PreIREventTopic.timeLag.warn.pollcount (default is 6 polls)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

IR-3-PreIREventTopic

Status WARN
Error message Instance has not consumed events for <x mins/hrs>.
What is happening The instance has not consumed any events during the specified period of time.

This warning is triggered when instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for warnPollCount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.IR.PreIREventTopic.instance.timeLag.warn.pollcount property to customize the threshold for when a warning is triggered. The default is 12 polls.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

IR-1-IRTopic

Status WARN
Error message Indicator is in WARN state because of one of the two reasons:

1. One/more of the instances have not consumed events since <x mins/hrs>
2. TimeLag of one/more instances is x% of topic's retention <topicRetention>) for <x mins/hrs>. Consumer may start missing events soon

What is happening Consumers in Splunk UBA consume events from topics in the Kafka queue. When an event has stayed on its topic for the configured retention period, it is dropped to make room for newer events. This warning is triggered when timeLag > warnThreshold % of topic's retention for warnPollCount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the following properties:
    • warnThreshold: ubaMonitor.IR.RawDataTopic.timeLag.warn.threshold (default is 80 percent)
    • warnPollCount: ubaMonitor.IR.RawDataTopic.timeLag.warn.pollcount (default is 6 polls)
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.

IR-3-IRTopic

Status WARN
Error message Instance has not consumed events for <x mins/hrs>.
What is happening The instance has not consumed any events during the specified period of time. This warning is triggered when the instance's timeLag on the topic (min(latestRecordTimes) across its partitions with eventLag > 0) has remained constant for badPollCount number of polls.

You can perform the following tasks to customize the thresholds for when a warning is triggered:

  1. Edit /etc/caspida/local/conf/uba-site.properties and add or edit the ubaMonitor.IR.RawDataTopic.instance.timeLag.bad.pollcount property to customize the threshold for when a warning is triggered. The default is 12 polls.
  2. Synchronize the cluster in distributed Splunk UBA deployments:
    /opt/caspida/bin/Caspida sync-cluster  /etc/caspida/local/conf
  3. Restart the required services by running the following command on the management node:
    sudo service caspida-sysmon restart
What to do Keep monitoring for the next 4 hours. If the OK status does not return during that time, contact Splunk Support.
Last modified on 13 December, 2022
PREVIOUS
Monitor the health of your Splunk UBA deployment
  NEXT
Collect diagnostic data from your Splunk UBA deployment

This documentation applies to the following versions of Splunk® User Behavior Analytics: 5.0.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.4.1, 5.0.5, 5.0.5.1, 5.1.0, 5.1.0.1


Was this documentation topic helpful?


You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters