Configure warm standby in Splunk UBA
Warm standby is a beta feature and must be implemented with the assistance of Splunk Support.
Configure a warm standby failover solution for Splunk UBA. The standby Splunk UBA system runs simultaneously with primary Splunk UBA system in read-only mode. When there is a failure with the primary system, you can manually failover to the backup system. The standby Splunk UBA system has a web UI to display mirrored data from the primary system in read-only mode.
Critical data stored in the Postgres database, such as the threats and anomalies generated by various rules and models, is synchronized in real-time.
All other data is synchronized every four hours by default. A checkpoint is created for each sync. When there is a failover, the standby Splunk UBA system begins to replay data ingestion from the last available checkpoint. The Splunk platform may retain new events for Splunk UBA to consume, but the search performance may be different depending on the time between the last checkpoint and when the backup system begins ingesting events. For example, some events in the Splunk platform may have been moved to the cold bucket, which negatively affects the search performance. In addition, since the ingestion lag in Splunk UBA is configurable per data source, some raw events with time stamps beyond the configured lag are excluded from the replay.
Additional data loss may occur if at the time of the failover, there are events in the data pipeline that are not yet consumed by Splunk UBA and therefore cannot persist in Splunk UBA. These events are lost and cannot be recovered during a failover operation.
Requirements to set up warm standby for Splunk UBA
Verify that the following requirements are met in preparation for configuring warm standby for Splunk UBA:
- The backup Splunk UBA system must be configured separately from the primary system and must meet all of the same system requirements. Verify that the backup system meets all of the requirements in the table:
Backup System Requirement Description Same number of nodes. The backup system must have the same number of nodes as the primary system. See Plan and scale your Splunk UBA deployment in Install and Upgrade Splunk User Behavior Analytics. Same hardware requirements. All nodes in the backup system must meet the minimum hardware requirements for all Splunk UBA servers, including allocating enough space on the master node if you are configuring incremental backups. See Hardware requirements in Install and Upgrade Splunk User Behavior Analytics. Same SSH keys. The backup system must use the same SSH keys as the primary system. Copy the SSH keys from the existing Splunk UBA deployment to all servers in the standby deployment. See Install Splunk User Behavior Analytics in Install and Upgrade Splunk User Behavior Analytics and follow the instructions for your deployment and operating system. Set up passwordless SSH. Each node in the backup and primary cluster must have passwordless SSH capability to any other node in either cluster. See Install Splunk User Behavior Analytics in Install and Upgrade Splunk User Behavior Analytics and follow the instructions for your deployment and operating system. Set up separate certificates. The backup system must have its own certificates that are setup separately from the primary system. - See Request and add a new certificate to Splunk UBA in Install and Upgrade Splunk User Behavior Analytics for the Splunk UBA web interface certificate.
- If you send anomalies and threats from Splunk UBA to Splunk Enterprise Security (ES) using an output connector, see Connect Splunk UBA to the Splunk platform using SSL in the Splunk Add-on for Splunk UBA manual to set up the Splunk ES certificate in Splunk UBA.
/etc/hosts file configuration. The /etc/hosts
file on each node in both the backup and primary systems must have the hostnames of all other nodes in both the backup and primary systems. See Configure host name lookups and DNS in Install and Upgrade Splunk User Behavior Analytics. - The backup system must have the same ports open as the primary system. See Network requirements in Install Splunk User Behavior Analytics. The following ports must be open behind the firewall in both the primary and standby clusters:
- Port 8020 on the management node (node 1) in all deployment sizes.
- Port 5432 on the database node in all deployment sizes. For deployments of 1 - 10 nodes, this is node 1. In 20 node deployments, this is node 2.
- Port 22 on all nodes in all deployment sizes must be open for scp and SSH to work.
- Port 50010 must be open on all the data nodes. This table identifies the data nodes per deployment:
Deployment size Data nodes 1 node Node 1 3 nodes Node 3 5 nodes Nodes 4 and 5 7 nodes Nodes 4, 5, 6, and 7 10 nodes Nodes 6, 7, 8, 9, and 10 20 nodes Nodes 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20
- The Splunk Enterprise deployment where Splunk UBA pulls data from must also be highly available. This is required for Splunk UBA to re-ingest data from Splunk Enterprise. || See Use clusters for high availability and ease of management in the Splunk Enterprise Distributed Deployment Manual.
- The raw events on Splunk Enterprise must be available for Splunk UBA to consume. If the Splunk Enterprise deployment is unable to retain raw events for Splunk UBA to re-ingest, the replay cannot be fully performed.
- If the Splunk UBA primary and standby deployments are across multiple sites, the standby Splunk UBA deployment must have its own Splunk Enterprise deployment equivalent to the primary site in order to provide equivalent ingestion throughput.
- Splunk UBA warm standby requires Python 3.
Set up the backup Splunk UBA deployment for warm standby
After meeting the requirements, perform the following tasks to deploy and set up a secondary Splunk UBA system as the warm standby system:
- If the standby system has some existing data, clean up the standby cluster before setup (which is optional):
/opt/caspida/bin/CaspidaCleanup
- Stop Caspida on both primary and standby system (on only master node):
/opt/caspida/bin/Caspida stop
- Add the following deployment properties to
/opt/caspida/conf/deployment/caspida-deployment.conf
on both the primary and and standby systems:- On the primary cluster, uncomment
caspida.cluster.replication.nodes
and add standby cluster nodes. For example, if for 3-node deployment of host s1, s2 and s3, add:caspida.cluster.replication.nodes=s1,s2,s3
In AWS environments, add the private IP addresses of each node. - On the standby cluster, uncomment
caspida.cluster.replication.nodes
and add the primary cluster nodes. For example, if for 3-node deployment of host p1, p2 and p3, add:caspida.cluster.replication.nodes=p1,p2,p3
In AWS environments, add the private IP addresses of each node.
The host names or IP addresses of the nodes on the primary and standby clusters do not need to be the same, as long as they are all defined in
caspida-deployment.conf
as shown in this example. - Run
sync-cluster
on the management node in both the primary and standby clusters:/opt/caspida/bin/Caspida sync-cluster
- On the primary cluster, uncomment
- Allow traffic across the primary and standby clusters:
- Setup passwordless SSH communication across all nodes of primary and standby clusters. See Setup passwordless communication between the UBA nodes in Install and Upgrade Splunk User Behavior Analytics.
- Setup firewalls by running the following on both primary and standby (only on master node):
/opt/caspida/bin/Caspida disablefirewall-cluster /opt/caspida/bin/Caspida setupfirewall-cluster /opt/caspida/bin/Caspida enablefirewall-cluster
- Register and enable replication.
- On both primary and standby clusters, add the following properties into
/etc/caspida/local/conf/uba-site.properties
:replication.enabled=true replication.primary.host=<master node of primary cluster> replication.standby.host=<master node of standby cluster>
- On management node of the primary cluster, run:
/opt/caspida/bin/replication/setup standby -m primary
If the same node has been registered before, a reset is necessary. Run the command again with the reset option:
/opt/caspida/bin/replication/setup standby -m primary -r
- On management node of standby cluster, run:
/opt/caspida/bin/replication/setup standby -m standby
- In the primary cluster, enable the replication system job by adding the
ReplicationCoordinator
property into/etc/caspida/local/conf/caspida-jobs.json
file. TheReplicationCoordinator
must be set totrue
. Below is a sample of the file before adding the property:/** * Copyright 2014 - Splunk Inc., All rights reserved. * This is Caspida proprietary and confidential material and its use * is subject to license terms. */ { "systemJobs": [ { // "name" : "ThreatComputation", // "cronExpr" : "0 0 0/1 * * ?", // "jobArguments" : { "env:CASPIDA_JVM_OPTS" : "-Xmx4096M" } } ] }
After adding the property, the file should look like this:
/** * Copyright 2014 - Splunk Inc., All rights reserved. * This is Caspida proprietary and confidential material and its use * is subject to license terms. */ { "systemJobs": [ { // "name" : "ThreatComputation", // "cronExpr" : "0 0 0/1 * * ?", // "jobArguments" : { "env:CASPIDA_JVM_OPTS" : "-Xmx4096M" } }, { "name" : "ReplicationCoordinator", "enabled" : true } ] }
- In both the primary and standby clusters, run
sync-cluster
to synchronize:/opt/caspida/bin/Caspida sync-cluster
- On both primary and standby clusters, add the following properties into
- If the standby cluster is running RHEL, CentOS, or Oracle Linux operating systems, run the following commands to create a directory on each node in the cluster:
sudo mkdir -m a=rwx /var/vcap/sys/run/caspida
- Start Caspida in both the primary and standby clusters by running the following command on the management node:
/opt/caspida/bin/Caspida start
The initial sync of full data transfer is triggered automatically when next scheduled job starts, defined by ReplicationCoordinator
in /opt/caspida/conf/jobconf/caspida-jobs.json
.
You can verify your setup and that the initial sync has started by viewing the table in the Postgres database that tracks the status of the sync between the primary and standby systems:
- Log in to the Postgres node in your deployment, such as node 2 in a 20-node cluster. See Where services run in Splunk UBA for information to help you locate the Postgres node in other deployments.
- Run the following command:
psql -d caspidadb -c 'select * from replication'
After the setup is completed, the status of the standby system is Inactive. After the first sync cycle is completed, the status is Active. If the initial sync fails, Splunk UBA will retry to sync every four hours. After four failures, the status of the standby system is Dead and replication is not attempted again until the issue is resolved.
During the sync, the following Splunk UBA components are synchronized:
- Datastores in HDFS (with some metadata in Postgres)
- Redis
- InfluxDB
These components are synchronized every four hours. To trigger a full sync right away, use the following command on the master node of the primary system:
curl -X POST -k -H "Authorization: Bearer $(grep '^\s*jobmanager.restServer.auth.user.token=' /opt/caspida/conf/uba-default.properties | cut -d'=' -f2)" https://localhost:9002/jobs/trigger?name=ReplicationCoordinator
View the /var/log/caspida/replication/replication.log
file on the master node of the primary system for additional information about the progress and status of the sync.
Failover to a backup Splunk UBA system
Perform the following tasks to failover Splunk UBA to the backup system. Make sure the backup system has been properly configured for warm standby. See Set up the backup Splunk UBA deployment for warm standby for instructions.
- Login to the management node on the backup Splunk UBA system.
- Run the
failover
command:This command promotes the backup system to be the active Splunk UBA system./opt/caspida/bin/replication/failover
- Check and verify that the
uiServer.host
property in the/etc/caspida/local/conf/uba-site.properties
file in the standby system matches the setting in the primary system. Depending on whether there is a proxy or DNS server between Splunk UBA and Splunk ES, this property may be changed during the failover operation. See Specify the host name of your Splunk UBA server in Install and Configure Splunk User Behavior Analytics for instructions. - If needed, edit the data sources to point to a Splunk search head with a different host name than before:
- In Splunk UBA, select Manage > Data Sources.
- Edit the data source for which you need to change the host name.
- Change the URL to have the name or IP address of the new host.
- Navigate through the wizard and change any other information as desired.
- Click OK. A new job for this data source will be started.
- If needed, edit the Splunk ES output connector to update the URL:
- In Splunk UBA, select Manage > Output Connectors.
- Click the Splunk ES output connector and update the URL.
- Click OK. This will automatically trigger a one-time sync with Splunk ES.
After the failover, the backup Splunk UBA system will be running as an independent system without HA/DR configured. After you restore the original Splunk UBA deployment that went down, you have the following options. For example, suppose system A went down and you perform a failover on system B, which is now the only system running Splunk UBA:
- Restore system A and configure it as the backup system. Leave system B as the primary system running Splunk UBA. See Set up the backup Splunk UBA deployment for warm standby for information about how to set up system A as the backup system, causing the data in system B to be synchronized to system A.
- Bring system A back online and configure it as the backup system, then failover from system B to system A so that system A is the primary. Perform the following tasks:
- See Set up the backup Splunk UBA deployment for warm standby for information about how to set up system A as the backup system, causing the data in system B to be synchronized to system A.
- Failover to system A.
By default, Splunk UBA can go back four hours to ingest data from data sources that are stopped and restarted. If the amount of time between when then primary system goes down and the failover to the backup system occurs is greater than four hours, adjust the connector.splunk.max.backtrace.time.in.hour
property in the /etc/caspida/local/conf/uba-site.properties
file. Perform the following tasks:
- Edit the
/etc/caspida/local/conf/uba-site.properties
file. - Add or edit the
connector.splunk.max.backtrace.time.in.hour
property. For example, if the primary system went down at 11PM on Friday and the failover was performed at 8AM on Monday, set the property to 57 hours or more to ingest data from the time that the primary system went down. - Synchronize the cluster in distributed deployments:
/opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf
See Time-based search for more information about configuring this property.
Configure the active system to stop syncing with a backup system
If you have a case where the backup Splunk UBA system fails, perform the following tasks to stop the active system from trying to synchronize with the standby system:
- Log in to the master node of the active Splunk UBA cluster as caspida.
- Stop Splunk UBA and all services:
/opt/caspida/bin/Caspida stop
- Edit
/etc/caspida/local/conf/uba-site.properties
and change thereplication.enabled
property to false:replication.enabled=false
- Synchronize the cluster:
/opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf
- Start Splunk UBA:
/opt/caspida/bin/Caspida start
You can continue to use automated incremental backups for your Splunk UBA cluster. See Backup and restore Splunk UBA using automated incremental backups.
Backup and restore Splunk UBA using the backup and restore scripts | Perform maintenance on your Splunk UBA clusters using warm standby |
This documentation applies to the following versions of Splunk® User Behavior Analytics: 5.0.0, 5.0.1, 5.0.2, 5.0.3
Feedback submitted, thanks!