Monitor the health of your Splunk UBA deployment
The Health Monitor dashboard helps you assess the health of your Splunk UBA deployment and verify the quality of data added to Splunk UBA. You can open the Health Monitor by selecting System > Health Monitor.
You can also use the Health Monitor to review system errors. System errors appear as messages in the menu.
- The bell icon appears when there are messages. Click the bell icon to view messages.
- Click a message or error to open the Health Monitor dashboard.
You can have system errors emailed to you. See Monitor system health with the health check script.
Splunk UBA maintains historical information about the health of your Splunk UBA deployment. You can download the information when you collect diagnostic data from the System Monitor and System Resources Monitor modules for your Splunk UBA deployment. See Collect diagnostic data from your Splunk UBA deployment.
You have the option to create a custom SSH banner message that displays when users login. Do not configure this SSH banner in /etc/ssh/ssh_config, as it can impact the Health Monitor. Set the custom banner in /etc/ssh/sshd_config for optimal compatibility.
Enable test mode for specific health indicators
You can enable test mode for health indicators that are not helpful or that are not producing relevant system monitoring information. The test mode status replaces the OK, BAD, or WARN status for health status indicators.
- Log in to the Splunk UBA management server as the caspida user using SSH.
- Open the
/etc/caspida/local/conf/uba-site.properties
file in an editor. - Change the
ubaMonitor.<module_id>.<indicator_id>.mutable
parameter fromtrue
tofalse
to enable test mode. For example:ubaMonitor.pipeline.reToOCSNewAnomalyLag.mutable=false
. - Save your file.
View system health
To view system health information, perform the following steps:
- Select System > Health Monitor.
- Click System, if it is not already selected.
This page displays the IP addresses, host names, and deployment types of the server nodes in your environment. This view is mostly informational and displays errors when CPU usage and disk usage are higher than 90%. Click a row in the table to learn more about a specific server.
View services health
To view services health information, select System > Cluster.
Splunk UBA relies on several services and processes to create anomalies and threats, process events, identify user and device associations, and more. Monitor the health of these services and processes on this page.
In a distributed system, the host IP shows for each service so that you can find errors related to a specific host.
Service name | Service process | Description |
---|---|---|
Analytics Aggregator Service | analyticsaggregator | This service aggregates and compresses data written by Analytics Writer service. |
Analytics View Builder Service | analyticsviewbuilder | This service periodically updates materialized views in the analytics database. Splunk UBA uses these views to render dashboards. |
Analytics Writer Service | analyticswriter | This service writes new aggregated events and models output to the analytics database. |
Anomaly Aggregation Task | anomalyaggregationmodel | This model pre-processes new anomalies by enhances the entities in the anomalies, then storing the anomalies in the database. |
Docker | docker | This service builds the containerized services, such as the streaming models, data ingestion, and identity resolution. |
Hadoop | hadoop-hdfs-namenode | This service keeps track of where data is stored in the Hadoop Distributed File System. |
hadoop-hdfs-datanode | This service stores data in the Hadoop Distributed File System. | |
hadoop-hdfs-secondarynamenode | This dedicated node in the HDFS cluster takes checkpoints of the file system metadata present on the hadoop-hdfs-namenode. It is not a backup namenode. | |
Hive Metastore | hive-metastore | This service stores the metadata for Hive/Impala tables and partitions in a relational database, and provides clients access to this information using the metastore service API. |
Impala | impala-catalog | This internal service is used by the Apache Impala database engine. |
impala-server | This service is the Apache Impala database server. | |
impala-state-store | This internal service is used by the Apache Impala database engine. | |
Job Agent | caspida-jobagent | This service manages all jobs in a multi node environment. This service is not applicable for single-node environments. |
Job Manager | caspida-jobmanager | This service runs and manages jobs for Splunk UBA. |
Kafka | kafka-server | This service acts as the message bus for Splunk UBA. |
Kubelet | kubelet | This Kubernetes component makes sure that the containers are running in pods. |
Offline Rule Executor | caspida-offlineruleexec | This service runs custom threats and anomaly rules. |
Output Connector Server | caspida-outputconnector | This service is the outbound connection to external data sources such as Splunk Enterprise Security, Email, or ServiceNow. |
PostgreSQL | postgresql | This service stores the Splunk UBA system metadata. |
Realtime Rule Executor | caspida-realtimeruleexec | This service runs anomaly action rules on generated anomalies. |
Redis | redis-server | This service is the in-memory store that caches model metadata and system-wide configuration parameters. |
Spark | spark-history | This service monitors the Apache Spark history server. |
spark-master | This service monitors the Apache Spark master server. | |
spark-server | This service submits Spark jobs to the Spark backend. | |
spark-worker | This service monitors the Apache Spark workers running in the cluster. | |
System Monitor | caspida-sysmon | This service monitors the Splunk UBA system. |
Splunk | splunkd | This service monitors the status of the Splunk forwarder when enabled. A Splunk forwarder is needed to send data from Splunk UBA to the Splunk UBA Monitoring App. |
Time Series DB | influxdb | This service stores time series data. |
UBA ETL Service | etl | This service parses events and runs identity resolution for devices and users from IR cache. It also runs all active decorators such as geolocation, threat intel, and entity validation. |
UBA Identity Resolver Service | identityresolver | This service processes events to build IR data. |
UBA Streaming Models devicetopic-modelgroupxx | devicetopic-modelgroupxx | This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons. |
UBA Streaming Models domaintopic-modelgroupxx | domaintopic-modelgroupxx | This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons. |
UBA Streaming Models eventtopic-modelgroupxx | eventtopic-modelgroupxx | This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons. |
UBA UI | caspida-ui | This service is the Splunk UBA web interface. |
Unusual Per Day Activity Time Model | uthourperusermodel-xx | This model detects unusual activity time during a day by a user based on his/her normal access profile. |
Zookeeper | zookeeper-server | This service synchronizes services and manages global configurations. |
View modules health
To view the health status of Splunk UBA modules, perform the following steps:
- Select System > Health Monitor.
- Click Modules.
Review the status of various modules that make up the Splunk UBA product. Determine what to do if you see error codes that appear on the Modules health dashboard.
Module name | Indicator | Description |
---|---|---|
Analytics Aggregator Service | Last Activity Check | This service checks for the last activity to determine if the service is working as expected. |
Analytics View Builder Service | Last Activity Check | This service checks for the last activity to determine if the service is working as expected. |
Analytics Writer Service | Time Lag on AnalyticsTopic | This service shows the time lag of the AnalyticsTopic. |
Time Lag on IRTopic | This service shows the time lag of the IRTopic. | |
EPS on AnalyticsTopic | This service shows the number of events processed per second (EPS) by the service on AnalyticsTopic. | |
EPS on IRTopic | This service shows the number of events processed per second by the service on IRTopic. | |
Event Lag on AnalyticsTopic | This service shows the number of events waiting to be ingested on AnalyticsTopic. | |
Event Lag on IRTopic | This service shows the number of events waiting to be ingested on IRTopic. | |
Events dropped from Kafka on AnalyticsTopic | This service shows the percentage of events dropped from Kafka on AnalyticsTopic. | |
Events dropped from Kafka on IRTopic | This service shows the percentage of events dropped from Kafka on IRTopic. | |
Anomaly Aggregation Task | Time Lag on AnomalyTopic | This service shows the time lag of the AnomalyTopic. |
EPS on AnomalyTopic | This service shows the number of events processed per second by the service on AnomalyTopic. | |
Event Lag on AnomalyTopic | This service shows the number of events waiting to be ingested on AnomalyTopic. | |
Events dropped from Kafka on AnomalyTopic | This service shows the percentage of events dropped from Kafka on AnomalyTopic. | |
Data Source | Assets data retrieval time | This service shows the last time assets data was retrieved. |
Data Source Processing | This service shows the EPS of data sources in the Processing state. | |
Events Count by Data Format | This service shows the number of events processed by data format. | |
HR data retrieval time | This service shows the last time HR data was retrieved. | |
Overall EPS of all Datasources | This service shows the aggregated EPS of all data sources in the Processing state. | |
Percentage of Events skipped | This service shown the percentage of events skipped. | |
Splunk Data Source Lag | This service monitors the data ingestion search lag for all Splunk data sources, including Kafka data ingestion. The lag is defined as the duration between search submission time and the search's latest time. If lag is beyond 3600 seconds, warning message is displayed. | |
Splunk Data Source Search Status Check | This service Monitors the health of data ingestion into Splunk UBA by tracking errors in the Splunk data source searches, including Kafka data ingestion. | |
Kafka Broker | Kafka Broker indicators are only available by adding | |
All topics bytes in | The number of bytes per second being received by the Kafka broker. | |
All topics bytes out | The number of bytes per second being read by the consumers. | |
Request handler idle ratio | The percentage of time that the brokers' request handlers are idle. | |
Request process time - 99th Percentile | The time in milliseconds that it takes for the brokers to fully process 99% of requests per request type. Click on the View <number> Values link to see more information. | |
Request process time - Average | The average time in milliseconds that it takes for the brokers to fully process requests per request type. Click on the View <number> Values link to see more information. | |
Topic bytes in | The rate in bytes per second of the message traffic each topic is receiving from producing clients. Click on the View <number> Values link to see more information. | |
Topic bytes out | The rate in bytes per second of the message traffic consumed by clients of each topic. Click on the View <number> Values link to see more information. | |
Topic partition count | The number of partitions per topic. Click on the View <number> Values link to see more information. | |
Topic size on disk | The size each topic occupies on disk. Click on the View <number> Values link to see more information. | |
Total size on disk | The total size of all the Kafka topics. | |
Model Store | Model Store indicators are only available by adding | |
Average Deserialization Delay | Average deserialization delay for each model. Click on the View <number> Values link to see more information. | |
Average Size of Stored Models | Average size of each stored model. Click on the View <number> Values link to see more information. | |
Maximum Size of Stored Models | Maximum size for each model. Click on the View <number> Values link to see more information. | |
Number of Models Stored | Number of models stored. Click on the View <number> Values link to see more information. | |
Offline Models | All Offline Models indicators except for Last Execution Time per Model are only available by adding | |
Completed Stages | Number of completed stages for each offline model in the latest execution. Click on the View <number> Values link to see more information. | |
Completed Tasks | Number of completed tasks for each offline model in the latest execution. Click on the View <number> Values link to see more information. | |
Disk Bytes Spilled | Number of disk bytes spilled by each offline model in the latest execution. Data that does not fit in the memory is "spilled" to the disk. Click on the View <number> Values link to see more information. | |
Execution Duration | Amount of time it took for each offline model to run. Click on the View <number> Values link to see more information. | |
Failed Stages | Number of failed stages for each offline model in the latest execution. Click on the View <number> Values link to see more information. | |
Failed Tasks | Number of failed tasks for each offline model in the latest execution. Click on the View <number> Values link to see more information. | |
Last Execution Time Per Model | The last time each offline model was run. Click on the View <number> Values link to see more information. | |
Longest Stage Duration | The longest stage duration for each offline model during its last execution. Click on the View <number> Values link to see more information. | |
Shuffle Read Bytes | Number of shuffle read bytes for each offline model. Click on the View <number> Values link to see more information. | |
Shuffle Read Records | Number of shuffle read records for each offline model. Click on the View <number> Values link to see more information. | |
Shuffle Write Bytes | Number of shuffle write bytes for each offline model. Click on the View <number> Values link to see more information. | |
Shuffle Write Records | Number of shuffle write records for each offline model. Click on the View <number> Values link to see more information. | |
Skipped Stages | Number of stages skipped for each offline model. Click on the View <number> Values link to see more information. | |
Skipped Tasks | Number of tasks skipped for each offline model. Click on the View <number> Values link to see more information. | |
Total Jobs | Total number of jobs for each offline model. Click on the View <number> Values link to see more information. | |
Total Stages | Total number of stages for each offline model. Click on the View <number> Values link to see more information. | |
Total Tasks | Total number of tasks for each offline model. Click on the View <number> Values link to see more information. | |
Offline Rule Executor | Threat Revalidation | This service checks whether the average time to revalidate threats since the last system restart is within the normal range. |
Output Connector Server | Anomalies dropped from Kafka | This service shows the percentage of anomalies dropped from Kafka. |
Anomalies Time Lag | This service shows the time lag of the Output Connector Server on the anomaly input queue. | |
Audit and control events dropped from Kafka | This service shows the percentage of audit and control events dropped from Kafka. | |
Email Failure Percentage | This service shows the percentage of email attempts that failed. This indicator is only available by adding | |
Events Time Lag | This service shows the time lag of the Output Connector Server on the events input queue. | |
Sending Threats to Enterprise Security is halted | This service monitors whether or not Splunk UBA is able to send threats to Splunk ES. | |
Postgre SQL | Number of Suppressed Anomalies | This service shows the total number of anomalies in the system which have been suppressed either manually or by anomaly action rules. |
Realtime Rule Executor | Anomalies dropped from Kafka | This service shows the percentage of anomalies that were dropped from Kafka. |
Time Lag | This service shows the time lag of the anomalies being processed by Kafka. | |
Rate of Anomaly Generation | This service checks whether the average number of anomalies processed per second over the last 10 minutes is within the normal range. This indicator is only available by adding | |
Threat Computation Task | Graph-based Threat Computation | This service shows OK if graph-based threat computation is running in a timely manner. |
Threat Computation | This service shows OK if overall threat computation is completing successfully. | |
Threat Computation Duration | This service shows OK if overall threat computation is completing in the expected amount of time. | |
UBA ETL Service | Time Lag on RawDataTopic | This service shows the time lag of the RawDataTopic. |
EPS on RawDataTopic | This service shows the EPS by the service on RawDataTopic. | |
Event Lag on RawDataTopic | This service shows the number of events waiting to be ingested on RawDataTopic. | |
Events dropped from Kafka on RawDataTopic | This service shows the percentage of events dropped from Kafka on RawDataTopic. | |
Latest Event Time on RawData Topic | This service monitors the time of the last event processed on the RawData topic.
This indicator is only available by adding | |
Time Difference on RawData Topic | This service monitors the maximum time difference among the latest processed events of each of the Splunk UBA raw ETL parsers.
This indicator is only available by adding | |
UBA Identity Resolver Service | Time Lag on IRTopic | This service shows the time lag of the IRTopic. |
Time Lag on PreIREventTopic | This service shows the time lag of the PreIREventTopic. | |
EPS on IRTopic | This service shows the EPS of the service on IRTopic. | |
EPS on PreIREventTopic | This service shows the EPS of the service on PreIREventTopic. | |
Event Lag on PreIREventTopic | This service shows the number of events waiting to be ingested on PreIREventTopic. | |
Events dropped from Kafka on IRTopic | This service shows the percentage of events dropped from Kafka on IRTopic. | |
Events dropped from Kafka on PreIREventTopic | This service shows the percentage of events dropped from Kafka on PreIREventTopic. | |
UBA Pipeline | NewAnomalyTopic | This service shows the status of the NewAnomalyTopic. |
UBA Streaming Models devicetopic-modelgroupnn | Time Lag on DeviceTopic | This service shows the time lag of the DeviceTopic. |
EPS on DeviceTopic | This service shows the EPS of the service on DeviceTopic. | |
Event Lag on DeviceTopic | This service shows the number of events waiting to be ingested on DeviceTopic. | |
Events dropped from Kafka on DeviceTopic | This topic shows the percentage of events dropped from Kafka on DeviceTopic. | |
UBA Streaming Models domaintopic-modelgroupnn | Time Lag on DomainTopic | This service shows the time lag of the DomainTopic. |
Time Lag on DomainTopic | This service shows the time lag of the DomainTopic. | |
EPS on DomainTopic | This service shows the EPS of the service on DomainTopic. | |
Event Lag on DomainTopic | This service shows the number of events waiting to be ingested on DomainTopic. | |
Events dropped from Kafka on DomainTopic | This service shows the percentage of events dropped from Kafka on DomainTopic. | |
UBA Streaming Models eventtopic-modelgroupnn | Time Lag on EventTopic | This service shows the time lag of the EventTopic. |
EPS on EventTopic | This server shows the EPS of the service on EventTopic. | |
Event Lag on EventTopic | This service shows the number of events waiting to be ingested on EventTopic. | |
Events dropped from Kafka on EventTopic | This service shows the percentage of events dropped from Kafka on EventTopic. | |
Unusual Per Day Activity Time Model | Time Lag on EventTopic. | This service shows the time lag of the EventTopic. |
EPS on EventTopic | This server shows the EPS of the service on EventTopic. | |
Event Lag on EventTopic | This service shows the number of events waiting to be ingested on EventTopic. | |
Events dropped from Kafka on EventTopic | This service shows the percentage of events dropped from Kafka on EventTopic. |
View data quality Metrics
To view data quality information, perform the following steps:
- Select System > Health Monitor.
- Click Data Quality.
Review metrics about the quality of data in your system. If system issues affect the quality of data, errors appear on this page.
Module | Indicator | Description |
---|---|---|
Data Source | Data Source EPS on Splunk | This service shows the average number of events processed per second by each data source on Splunk in the last hour. |
Percentage of Events dropped by EventFilters | This service shows the percentage of events dropped by EventFilters on the UI. | |
Percentage of Events with no entity | This service shows the percentage of events that have no entity. | |
Percentage of Events with no Relevant Data | This service shows the percentage of events that have no relevant data. | |
Splunk Direct Data Source Enum Check | This service monitors the Splunk Direct input enum field data quality and tracks the mismatch rate (percentage) in each data source. | |
Offline Rule Executor | Average Execution Time Per Rule | This service shows the average execution time of each custom threat, anomaly action rule, or anomaly rule. |
Last Execution End Time per Rule | This service shows the last time each custom threat, anomaly action rule, or anomaly rule finished running. | |
Last Execution Failure per Rule | This service shows the last time at which a custom threat, anomaly action rule, or anomaly rule failed to run. | |
Last Execution Start Time per Rule | This service shows the start time of the most recent run of each custom threat, anomaly action rule, or anomaly rule. | |
Number of Execution Failures per Rule | This service shows the number of consecutive times each rule failed to run, or 0 if no failures have occurred. | |
Number of Executions per Rule | This service shows the total number of times that each custom threat, anomaly action rule, or anomaly rule has attempted to run. Both successful and failed attempts are counted. | |
Output Connector Server | Number of Threats Sent to Output Connector | This service shows the total number of threats sent to the output connector for forwarding to Splunk Enterprise Security or other external destinations in the time since Splunk UBA was last restarted. |
Total New Anomalies | This service shows the number of new anomalies received by the output connector server in this session. You can compare this number to the number of anomalies in the receiving system, such as Splunk Enterprise Security, to determine if all anomalies are successfully being processed by the system to which Splunk UBA is sending anomalies. | |
PostgreSQL | Number of Inactive Anomalies | This service shows the total number of anomalies currently processing in the system. Contact Splunk Support if the number is consistently above several thousand or the number continues to increase. |
Number of Suppressed Anomalies | This service shows the total number of anomalies in the system that have been suppressed manually or by anomaly action rules. | |
Real-time Rule Executor | Average New Anomalies Completed | This service shows the average number of anomalies processed per second since the last restart of the Realtime Rule Executor. |
Dropped New Anomalies | This service shows the total number of dropped anomalies that were not duplicates since the last restart of the Realtime Rule Executor. | |
Duplicate New Anomalies | This service shows the total number of duplicate anomalies since the last restart of the Realtime Rule Executor. | |
New Anomalies Received | This service shows the total number of new anomalies created by Splunk UBA since the last restart of the Realtime Rule Executor. | |
New Anomalies Completed | This service shows the total number of new anomalies processed since the last restart of the Realtime Rule Executor. | |
New Anomalies in Process | This service shows the number of new anomalies currently being processed by Splunk UBA. | |
Number of Active Anomalies | This service shows the total number of anomalies activated by the Realtime Rule Executor after the last restart of the real-time rule executor. Activated anomalies are anomalies that were not suppressed or permanently deleted by anomaly action rules. | |
Threat Computation Task | Threat Computation Start Time | This service shows the last time the threat computation task was started. |
Monitor system health with the health check script
The health check script captures the state of a running system and highlights areas of concern such as event processing lags, system slowness, and errors in services like Apache Kafka.
You can schedule the script to run regularly as a cron job and email the output as an attachment. See Configure email alerts to your Splunk UBA deployment administrators.
The uba_health_check.sh
script is stored in the /opt/caspida/bin/utils
directory of Splunk UBA. Log in as the caspida user on the management server using SSH to run the script.
Output from the script is saved in a plain text file in the /var/log/caspida/check/
directory with a file name that includes the host name of the server and the time stamp. You can also download the script output from Splunk UBA. Select the Scripts module. See Collect diagnostic data from your Splunk UBA deployment.
Monitor server health with SNMP
You can use an SNMP monitoring tool to track statistics related to CPU usage, memory, and disk utilization on any server that has Splunk UBA installed.
Monitor your Splunk UBA deployment directly from Splunk Enterprise | Health Monitor status code reference |
This documentation applies to the following versions of Splunk® User Behavior Analytics: 5.0.3, 5.0.4, 5.0.4.1, 5.0.5, 5.0.5.1, 5.1.0, 5.1.0.1, 5.2.0, 5.2.1
Feedback submitted, thanks!