Monitor the health of your Splunk UBA deployment

The Health Monitor dashboard helps you assess the health of your Splunk UBA deployment and verify the quality of data added to Splunk UBA. You can open the Health Monitor by selecting System > Health Monitor.

You can also use the Health Monitor to review system errors. System errors appear as messages in the menu.

The bell icon appears when there are messages. Click the bell icon to view messages.
Click a message or error to open the Health Monitor dashboard.

You can have system errors emailed to you. See Monitor system health with the health check script.

Splunk UBA maintains historical information about the health of your Splunk UBA deployment. You can download the information when you collect diagnostic data from the System Monitor and System Resources Monitor modules for your Splunk UBA deployment. See Collect diagnostic data from your Splunk UBA deployment.

Enable test mode for specific health indicators

You can enable test mode for health indicators that are not helpful or that are not producing relevant system monitoring information. The test mode status replaces the OK, BAD, or WARN status for health status indicators.

Log in to the Splunk UBA management server as the caspida user using SSH.
Open the /etc/caspida/local/conf/uba-site.properties file in an editor.
Change the ubaMonitor.<module_id>.<indicator_id>.mutable parameter from true to false to enable test mode. For example:
ubaMonitor.pipeline.reToOCSNewAnomalyLag.mutable=false.
Save your file.

View system health

To view system health information, perform the following steps:

Select System > Health Monitor.
Click System, if it is not already selected.

This page displays the IP addresses, host names, and deployment types of the server nodes in your environment. This view is mostly informational and displays errors when CPU usage and disk usage are higher than 90%. Click a row in the table to learn more about a specific server.

View services health

To view services health information, select System > Cluster.

Splunk UBA relies on several services and processes to create anomalies and threats, process events, identify user and device associations, and more. Monitor the health of these services and processes on this page.

In a distributed system, the host IP shows for each service so that you can find errors related to a specific host.

Service name	Service process	Description
Analytics Aggregator Service	analyticsaggregator	This service aggregates and compresses data written by Analytics Writer service.
Analytics View Builder Service	analyticsviewbuilder	This service periodically updates materialized views in the analytics database. Splunk UBA uses these views to render dashboards.
Analytics Writer Service	analyticswriter	This service writes new aggregated events and models output to the analytics database.
Anomaly Aggregation Task	anomalyaggregationmodel	This model pre-processes new anomalies by enhances the entities in the anomalies, then storing the anomalies in the database.
Docker	docker	This service builds the containerized services, such as the streaming models, data ingestion, and identity resolution.
Hadoop	hadoop-hdfs-namenode	This service keeps track of where data is stored in the Hadoop Distributed File System.
	hadoop-hdfs-datanode	This service stores data in the Hadoop Distributed File System.
	hadoop-hdfs-secondarynamenode	This dedicated node in the HDFS cluster takes checkpoints of the file system metadata present on the hadoop-hdfs-namenode. It is not a backup namenode.
Hive Metastore	hive-metastore	This service stores the metadata for Hive/Impala tables and partitions in a relational database, and provides clients access to this information using the metastore service API.
Impala	impala-catalog	This internal service is used by the Apache Impala database engine.
	impala-server	This service is the Apache Impala database server.
	impala-state-store	This internal service is used by the Apache Impala database engine.
Job Agent	caspida-jobagent	This service manages all jobs in a multi node environment. This service is not applicable for single-node environments.
Job Manager	caspida-jobmanager	This service runs and manages jobs for Splunk UBA.
Kafka	kafka-server	This service acts as the message bus for Splunk UBA.
Kubelet	kubelet	This Kubernetes component makes sure that the containers are running in pods.
Offline Rule Executor	caspida-offlineruleexec	This service runs custom threats and anomaly rules.
Output Connector Server	caspida-outputconnector	This service is the outbound connection to external data sources such as Splunk Enterprise Security, Email, or ServiceNow.
PostgreSQL	postgresql	This service stores the Splunk UBA system metadata.
Realtime Rule Executor	caspida-realtimeruleexec	This service runs anomaly action rules on generated anomalies.
Redis	redis-server	This service is the in-memory store that caches model metadata and system-wide configuration parameters.
Spark	spark-history	This service monitors the Apache Spark history server.
	spark-master	This service monitors the Apache Spark master server.
	spark-server	This service submits Spark jobs to the Spark backend.
	spark-worker	This service monitors the Apache Spark workers running in the cluster.
System Monitor	caspida-sysmon	This service monitors the Splunk UBA system.
Splunk	splunkd	This service monitors the status of the Splunk forwarder when enabled. A Splunk forwarder is needed to send data from Splunk UBA to the Splunk UBA Monitoring App.
Time Series DB	influxdb	This service stores time series data.
UBA ETL Service	etl	This service parses events and runs identity resolution for devices and users from IR cache. It also runs all active decorators such as geolocation, threat intel, and entity validation.
UBA Identity Resolver Service	identityresolver	This service processes events to build IR data.
UBA Streaming Models devicetopic-modelgroupxx	devicetopic-modelgroupxx	This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons.
UBA Streaming Models domaintopic-modelgroupxx	domaintopic-modelgroupxx	This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons.
UBA Streaming Models eventtopic-modelgroupxx	eventtopic-modelgroupxx	This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons.
UBA UI	caspida-ui	This service is the Splunk UBA web interface.
Unusual Per Day Activity Time Model	uthourperusermodel-xx	This model detects unusual activity time during a day by a user based on his/her normal access profile.
Zookeeper	zookeeper-server	This service synchronizes services and manages global configurations.

View modules health

To view the health status of Splunk UBA modules, perform the following steps:

Select System > Health Monitor.
Click Modules.

Review the status of various modules that make up the Splunk UBA product. Determine what to do if you see error codes that appear on the Modules health dashboard.

Module name	Indicator	Description
Analytics Aggregator Service	Last Activity Check	This service checks for the last activity to determine if the service is working as expected.
Analytics View Builder Service	Last Activity Check	This service checks for the last activity to determine if the service is working as expected.
Analytics Writer Service	Time Lag on AnalyticsTopic	This service shows the time lag of the AnalyticsTopic.
	Time Lag on IRTopic	This service shows the time lag of the IRTopic.
	EPS on AnalyticsTopic	This service shows the number of events processed per second (EPS) by the service on AnalyticsTopic.
	EPS on IRTopic	This service shows the number of events processed per second by the service on IRTopic.
	Event Lag on AnalyticsTopic	This service shows the number of events waiting to be ingested on AnalyticsTopic.
	Event Lag on IRTopic	This service shows the number of events waiting to be ingested on IRTopic.
	Events dropped from Kafka on AnalyticsTopic	This service shows the percentage of events dropped from Kafka on AnalyticsTopic.
	Events dropped from Kafka on IRTopic	This service shows the percentage of events dropped from Kafka on IRTopic.
Anomaly Aggregation Task	Time Lag on AnomalyTopic	This service shows the time lag of the AnomalyTopic.
	EPS on AnomalyTopic	This service shows the number of events processed per second by the service on AnomalyTopic.
	Event Lag on AnomalyTopic	This service shows the number of events waiting to be ingested on AnomalyTopic.
	Events dropped from Kafka on AnomalyTopic	This service shows the percentage of events dropped from Kafka on AnomalyTopic.
Data Source	Assets data retrieval time	This service shows the last time assets data was retrieved.
	Data Source Processing	This service shows the EPS of data sources in the Processing state.
	Events Count by Data Format	This service shows the number of events processed by data format.
	HR data retrieval time	This service shows the last time HR data was retrieved.
	Overall EPS of all Datasources	This service shows the aggregated EPS of all data sources in the Processing state.
	Percentage of Events skipped	This service shown the percentage of events skipped.
	Splunk Data Source Lag	This service monitors the data ingestion search lag for all Splunk data sources, including Kafka data ingestion. The lag is defined as the duration between search submission time and the search's latest time. If lag is beyond 3600 seconds, warning message is displayed.
	Splunk Data Source Search Status Check	This service Monitors the health of data ingestion into Splunk UBA by tracking errors in the Splunk data source searches, including Kafka data ingestion.
Kafka Broker	All topics bytes in	The number of bytes per second being received by the Kafka broker.
	All topics bytes out	The number of bytes per second being read by the consumers.
	Request handler idle ratio	The percentage of time that the brokers' request handlers are idle.
	Request process time - 99th Percentile	The time in milliseconds that it takes for the brokers to fully process 99% of requests per request type. Click on the View <number> Values link to see more information.
	Request process time - Average	The average time in milliseconds that it takes for the brokers to fully process requests per request type. Click on the View <number> Values link to see more information.
	Topic bytes in	The rate in bytes per second of the message traffic each topic is receiving from producing clients. Click on the View <number> Values link to see more information.
	Topic bytes out	The rate in bytes per second of the message traffic consumed by clients of each topic. Click on the View <number> Values link to see more information.
	Topic partition count	The number of partitions per topic. Click on the View <number> Values link to see more information.
	Topic size on disk	The size each topic occupies on disk. Click on the View <number> Values link to see more information.
	Total size on disk	The total size of all the Kafka topics.
Model Store	Average Deserialization Delay	Average deserialization delay for each model. Click on the View <number> Values link to see more information.
	Average Size of Stored Models	Average size of each stored model. Click on the View <number> Values link to see more information.
	Maximum Size of Stored Models	Maximum size for each model. Click on the View <number> Values link to see more information.
	Number of Models Stored	Number of models stored. Click on the View <number> Values link to see more information.
Offline Models	Completed Stages	Number of completed stages for each offline model in the latest execution. Click on the View <number> Values link to see more information.
	Completed Tasks	Number of completed tasks for each offline model in the latest execution. Click on the View <number> Values link to see more information.
	Disk Bytes Spilled	Number of disk bytes spilled by each offline model in the latest execution. Data that does not fit in the memory is "spilled" to the disk. Click on the View <number> Values link to see more information.
	Execution Duration	Amount of time it took for each offline model to run. Click on the View <number> Values link to see more information.
	Failed Stages	Number of failed stages for each offline model in the latest execution. Click on the View <number> Values link to see more information.
	Failed Tasks	Number of failed tasks for each offline model in the latest execution. Click on the View <number> Values link to see more information.
	Last Execution Time Per Model	The last time each offline model was run. Click on the View <number> Values link to see more information.
	Longest Stage Duration	The longest stage duration for each offline model during its last execution. Click on the View <number> Values link to see more information.
	Shuffle Read Bytes	Number of shuffle read bytes for each offline model. Click on the View <number> Values link to see more information.
	Shuffle Read Records	Number of shuffle read records for each offline model. Click on the View <number> Values link to see more information.
	Shuffle Write Bytes	Number of shuffle write bytes for each offline model. Click on the View <number> Values link to see more information.
	Shuffle Write Records	Number of shuffle write records for each offline model. Click on the View <number> Values link to see more information.
	Skipped Stages	Number of stages skipped for each offline model. Click on the View <number> Values link to see more information.
	Skipped Tasks	Number of tasks skipped for each offline model. Click on the View <number> Values link to see more information.
	Total Jobs	Total number of jobs for each offline model. Click on the View <number> Values link to see more information.
	Total Stages	Total number of stages for each offline model. Click on the View <number> Values link to see more information.
	Total Tasks	Total number of tasks for each offline model. Click on the View <number> Values link to see more information.
Offline Rule Executor	Threat Revalidation	This service checks whether the average time to revalidate threats since the last system restart is within the normal range.
Output Connector Server	Anomalies dropped from Kafka	This service shows the percentage of anomalies dropped from Kafka.
	Anomalies Time Lag	This service shows the time lag of the Output Connector Server on the anomaly input queue.
	Audit and control events dropped from Kafka	This service shows the percentage of audit and control events dropped from Kafka.
	Email Failure Percentage	This service shows the percentage of email attempts that failed.
	Events Time Lag	This service shows the time lag of the Output Connector Server on the events input queue.
	Sending Threats to SplunkES is halted	This service monitors whether or not Splunk UBA is able to send threats to Splunk ES.
Postgre SQL	Number of Suppressed Anomalies	This service shows the total number of anomalies in the system which have been suppressed either manually or by anomaly action rules.
Realtime Rule Executor	Anomalies dropped from Kafka	This service shows the percentage of anomalies that were dropped from Kafka.
Realtime Rule Executor	Time Lag	This service shows the time lag of the anomalies being processed by Kafka.
Threat Computation Task	Graph-based Threat Computation	This service shows OK if graph-based threat computation is running in a timely manner.
	Threat Computation	This service shows OK if overall threat computation is completing successfully.
	Threat Computation Duration	This service shows OK if overall threat computation is completing in the expected amount of time.
UBA ETL Service	Time Lag on RawDataTopic	This service shows the time lag of the RawDataTopic.
	EPS on RawDataTopic	This service shows the EPS by the service on RawDataTopic.
	Event Lag on RawDataTopic	This service shows the number of events waiting to be ingested on RawDataTopic.
	Events dropped from Kafka on RawDataTopic	This service shows the percentage of events dropped from Kafka on RawDataTopic.
UBA Identity Resolver Service	Time Lag on IRTopic	This service shows the time lag of the IRTopic.
	Time Lag on PreIREventTopic	This service shows the time lag of the PreIREventTopic.
	EPS on IRTopic	This service shows the EPS of the service on IRTopic.
	EPS on PreIREventTopic	This service shows the EPS of the service on PreIREventTopic.
	Event Lag on PreIREventTopic	This service shows the number of events waiting to be ingested on PreIREventTopic.
	Events dropped from Kafka on IRTopic	This service shows the percentage of events dropped from Kafka on IRTopic.
	Events dropped from Kafka on PreIREventTopic	This service shows the percentage of events dropped from Kafka on PreIREventTopic.
UBA Pipeline	NewAnomalyTopic	This service shows the status of the NewAnomalyTopic.
UBA Streaming Models devicetopic-modelgroupnn	Time Lag on DeviceTopic	This service shows the time lag of the DeviceTopic.
	EPS on DeviceTopic	This service shows the EPS of the service on DeviceTopic.
	Event Lag on DeviceTopic	This service shows the number of events waiting to be ingested on DeviceTopic.
	Events dropped from Kafka on DeviceTopic	This topic shows the percentage of events dropped from Kafka on DeviceTopic.
UBA Streaming Models domaintopic-modelgroupnn	Time Lag on DomainTopic	This service shows the time lag of the DomainTopic.
	Time Lag on DomainTopic	This service shows the time lag of the DomainTopic.
	EPS on DomainTopic	This service shows the EPS of the service on DomainTopic.
	Event Lag on DomainTopic	This service shows the number of events waiting to be ingested on DomainTopic.
	Events dropped from Kafka on DomainTopic	This service shows the percentage of events dropped from Kafka on DomainTopic.
UBA Streaming Models eventtopic-modelgroupnn	Time Lag on EventTopic	This service shows the time lag of the EventTopic.
	EPS on EventTopic	This server shows the EPS of the service on EventTopic.
	Event Lag on EventTopic	This service shows the number of events waiting to be ingested on EventTopic.
	Events dropped from Kafka on EventTopic	This service shows the percentage of events dropped from Kafka on EventTopic.
Unusual Per Day Activity Time Model	Time Lag on EventTopic.	This service shows the time lag of the EventTopic.
	EPS on EventTopic	This server shows the EPS of the service on EventTopic.
	Event Lag on EventTopic	This service shows the number of events waiting to be ingested on EventTopic.
	Events dropped from Kafka on EventTopic	This service shows the percentage of events dropped from Kafka on EventTopic.

View data quality Metrics

To view data quality information, perform the following steps:

Select System > Health Monitor.
Click Data Quality.

Review metrics about the quality of data in your system. If system issues affect the quality of data, errors appear on this page.

Module	Indicator	Description
Data Source	Data Source EPS on Splunk	This service shows the average number of events processed per second by each data source on Splunk in the last hour.
	Percentage of Events dropped by EventFilters	This service shows the percentage of events dropped by EventFilters on the UI.
	Percentage of Events with no entity	This service shows the percentage of events that have no entity.
	Percentage of Events with no Relevant Data	This service shows the percentage of events that have no relevant data.
	Splunk Direct Data Source Enum Check	This service monitors the Splunk Direct input enum field data quality and tracks the mismatch rate (percentage) in each data source.
Offline Rule Executor	Average Execution Time Per Rule	This service shows the average execution time of each custom threat, anomaly action rule, or anomaly rule.
	Last Execution End Time per Rule	This service shows the last time each custom threat, anomaly action rule, or anomaly rule finished running.
	Last Execution Failure per Rule	This service shows the last time at which a custom threat, anomaly action rule, or anomaly rule failed to run.
	Last Execution Start Time per Rule	This service shows the start time of the most recent run of each custom threat, anomaly action rule, or anomaly rule.
	Number of Execution Failures per Rule	This service shows the number of consecutive times each rule failed to run, or 0 if no failures have occurred.
	Number of Executions per Rule	This service shows the total number of times that each custom threat, anomaly action rule, or anomaly rule has attempted to run. Both successful and failed attempts are counted.
Output Connector Server	Number of Threats Sent to Output Connector	This service shows the total number of threats sent to the output connector for forwarding to Splunk Enterprise Security or other external destinations in the time since Splunk UBA was last restarted.
Output Connector Server	Total New Anomalies	This service shows the number of new anomalies received by the output connector server in this session. You can compare this number to the number of anomalies in the receiving system, such as Splunk Enterprise Security, to determine if all anomalies are successfully being processed by the system to which Splunk UBA is sending anomalies.
PostgreSQL	Number of Inactive Anomalies	This service shows the total number of anomalies currently processing in the system. Contact Splunk Support if the number is consistently above several thousand or the number continues to increase.
PostgreSQL	Number of Suppressed Anomalies	This service shows the total number of anomalies in the system that have been suppressed manually or by anomaly action rules.
Real-time Rule Executor	Average New Anomalies Completed	This service shows the average number of anomalies processed per second since the last restart of the Realtime Rule Executor.
	Dropped New Anomalies	This service shows the total number of dropped anomalies that were not duplicates since the last restart of the Realtime Rule Executor.
	Duplicate New Anomalies	This service shows the total number of duplicate anomalies since the last restart of the Realtime Rule Executor.
	New Anomalies Received	This service shows the total number of new anomalies created by Splunk UBA since the last restart of the Realtime Rule Executor.
	New Anomalies Completed	This service shows the total number of new anomalies processed since the last restart of the Realtime Rule Executor.
	New Anomalies in Process	This service shows the number of new anomalies currently being processed by Splunk UBA.
	Number of Active Anomalies	This service shows the total number of anomalies activated by the Realtime Rule Executor after the last restart of the real-time rule executor. Activated anomalies are anomalies that were not suppressed or permanently deleted by anomaly action rules.
Threat Computation Task	Threat Computation Start Time	This service shows the last time the threat computation task was started.

Monitor system health with the health check script

The health check script captures the state of a running system and highlights areas of concern such as event processing lags, system slowness, and errors in services like Apache Kafka.

You can schedule the script to run regularly as a cron job and email the output as an attachment. See Configure email alerts to your Splunk UBA deployment administrators.

The uba_health_check.sh script is stored in the /opt/caspida/bin/utils directory of Splunk UBA. Log in as the caspida user on the management server using SSH to run the script.

Output from the script is saved in a plain text file in the /var/log/caspida/check/ directory with a file name that includes the host name of the server and the time stamp. You can also download the script output from Splunk UBA. Select the Scripts module. See Collect diagnostic data from your Splunk UBA deployment.

Monitor server health with SNMP

You can use an SNMP monitoring tool to track statistics related to CPU usage, memory, and disk utilization on any server that has Splunk UBA installed.

Related answers from Splunk Community

Monitor the health of your Splunk UBA deployment

Enable test mode for specific health indicators

View system health

View services health

View modules health

View data quality Metrics

Monitor system health with the health check script

Monitor server health with SNMP

Comments

Monitor the health of your Splunk UBA deployment

Was this topic useful?