Set up the standby Splunk UBA system

The standby Splunk UBA system should run in read-only mode. Do not start any Splunk UBA services in the standby system. If you do, the PostgreSQL logs can fill up and negatively affect performance. See Clean up the standby system if you accidentally started Splunk UBA services.

After meeting the requirements, perform the following tasks to deploy and set up a secondary Splunk UBA system as the read-only warm standby system. See, Requirements to set up warm standby for Splunk UBA:

(Optional) If the standby system has existing data, run the following command to clean up the system:
```
/opt/caspida/bin/CaspidaCleanup
```
Run the following command on the management node of both the primary and standby systems:
```
/opt/caspida/bin/Caspida stop
```
Add the following deployment properties to /opt/caspida/conf/deployment/caspida-deployment.conf on both the primary and and standby systems:
1. On the primary system, uncomment caspida.cluster.replication.nodes and add standby system nodes. For example, if for 3-node deployment of host s1, s2 and s3, add:
```
caspida.cluster.replication.nodes=s1,s2,s3
```
  In AWS environments, add the private IP addresses of each node.
2. On the standby system, uncomment caspida.cluster.replication.nodes and add the primary system nodes. For example, if for 3-node deployment of host p1, p2 and p3, add:
```
caspida.cluster.replication.nodes=p1,p2,p3
```
  In AWS environments, add the private IP addresses of each node.
  
  The host names or IP addresses of the nodes on the primary and standby systems do not need to be the same, as long as they are all defined in caspida-deployment.conf as shown in this example.
3. Run sync-cluster on the management node on both the primary and standby systems:
```
/opt/caspida/bin/Caspida sync-cluster
```
Allow traffic across the primary and standby systems:
1. Setup inter-cluster passwordless SSH communication across all nodes of primary and standby systems. See Setup passwordless communication between the UBA nodes in Install and Upgrade Splunk User Behavior Analytics.
2. Set up firewalls by running the following commands on the management node on both the primary and standby systems:
```
/opt/caspida/bin/Caspida disablefirewall-cluster
/opt/caspida/bin/Caspida setupfirewall-cluster
/opt/caspida/bin/Caspida enablefirewall-cluster
```

Register and enable replication.

On both primary and standby systems, add the following properties into /etc/caspida/local/conf/uba-site.properties on the management node. If the replication.enabled property already exists, make sure it is set to true.
```
replication.enabled=true
replication.primary.host=<management node of primary cluster>
replication.standby.host=<management node of standby cluster>
```

In the primary cluster, enable the replication system job by adding the ReplicationCoordinator property into /etc/caspida/local/conf/caspida-jobs.json file on the management node. The ReplicationCoordinator must be set to true. Below is a sample of the file before adding the property:

/**
 * Copyright 2014 - Splunk Inc., All rights reserved.
 * This is Caspida proprietary and confidential material and its use
 * is subject to license terms.
 */
{
  "systemJobs": [
    {
      // "name" : "ThreatComputation",
      // "cronExpr"   : "0 0 0/1 * * ?",
      // "jobArguments" : { "env:CASPIDA_JVM_OPTS" :  "-Xmx4096M" }
    }
  ]
}

After adding the property, the file should look like this:

/**
 * Copyright 2014 - Splunk Inc., All rights reserved.
 * This is Caspida proprietary and confidential material and its use
 * is subject to license terms.
 */
{
  "systemJobs": [
    {
      // "name" : "ThreatComputation",
      // "cronExpr"   : "0 0 0/1 * * ?",
      // "jobArguments" : { "env:CASPIDA_JVM_OPTS" :  "-Xmx4096M" }
    },
    {
      "name"         : "ReplicationCoordinator",
      "enabled"      : true
    }
  ]
}

On both the primary and standby systems, run sync-cluster on the management node to synchronize the configuration changes:
```
/opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf/
```

On management node of the primary system, run the following command:
```
/opt/caspida/bin/replication/setup -d standby -m primary
```
If the same node has been registered before, a reset is necessary. Run the command again with the reset option:
```
/opt/caspida/bin/replication/setup -d standby -m primary -r
```
On management node of standby system, run the following command:
```
/opt/caspida/bin/replication/setup -d standby -m standby
```
If the standby system is running RHEL, CentOS, or Oracle Linux operating systems, run the following command to create a directory on each node in the cluster:
```
sudo mkdir -m a=rwx /var/vcap/sys/run/caspida
```
Start Splunk UBA on the primary system by running the following command on the management node:
```
/opt/caspida/bin/Caspida start
```

After you perform these steps, the primary and standby systems are not synchronized until 4 hours later. You can synchronize both systems using a curl command. See Synchronize the primary and standby systems on-demand for instructions.

Verify that the standby system is set up correctly

You can verify your setup by viewing the table in the Postgres database that tracks the status of the sync between the primary and standby systems.

In all deployments smaller than 20 nodes, run the following command on the management node to check. In 20-node clusters, run the command on node 2:

psql -d caspidadb -c 'select * from replication'

Below is an example output for a 3-node cluster:

caspida@ubanode001$ psql -d caspidadb -c "select * from replication"
 id |   host     |  type   |  status  |               message               |          modtime           | cycleid | cyclestatus 
----+------------+---------+----------+-------------------------------------+----------------------------+---------+-------------
 25 | ubanode01  | Standby | Inactive |                                     | 2021-09-01 23:29:01.572365 | 0       | 
 26 | ubanode001 | Primary | Active   | nodes:ubanode001,ubanode002,ubanode003 | 2021-09-01 23:29:01.636719 | 0000000 | 
(2 rows)

caspida@ubanode001$

In this example, the primary system ubanode001 has a status of Active, and the standby system ubanode01 has a status of Inactive. The Inactive status means that a sync between the primary and standby systems has not occurred yet.

Synchronize the primary and standby systems on-demand

To trigger a full sync right away, use the following command on the management node of the primary system:

curl -X POST -k -H "Authorization: Bearer $(grep '^\s*jobmanager.restServer.auth.user.token=' /opt/caspida/conf/uba-default.properties | cut -d'=' -f2)"  https://localhost:9002/jobs/trigger?name=ReplicationCoordinator

View the /var/log/caspida/replication/replication.log file on the management node of the primary system for additional information about the progress and status of the sync.

Verify that the primary and standby systems are synchronized

You can verify your setup and that the initial sync has started by viewing the table in the Postgres database that tracks the status of the sync between the primary and standby systems.

In all deployments smaller than 20 nodes, run the following command on the management node to check. In 20-node clusters, run the command on node 2:

psql -d caspidadb -c 'select * from replication'

Below is an example output for a 3-node cluster:

caspida@ubanode001$ psql -d caspidadb -c 'select * from replication'
 id |   host     |  type   | status |               message                   |          modtime           | cycleid | cyclestatus
----+------------+---------+--------+-----------------------------------------+----------------------------+---------+-------------
 25 | ubanode01  | Standby | Active |                                         | 2021-04-04 04:10:57.191609 | 0000002 |
 26 | ubanode001 | Primary | Active | nodes:ubanode001,ubanode002,ubanode003  | 2021-04-04 04:10:57.212676 | 0000002 | 
(2 rows)

After the setup is completed, the status of the standby system is Inactive. After the first sync cycle is completed, the status is Active. If the initial sync fails, Splunk UBA will retry to sync every four hours. After four failures, the status of the standby system is Dead and replication is not attempted again until the issue is resolved.

Manual verification

To manually verify that the datastores, such as Postgres, HDFS, IR Cache, InfluxDB, and Redis are synchronized, run the following health check script on the management nodes of the primary as well as the standby cluster:

/opt/caspida/bin/utils/uba_health_check.sh

Then, verify the following datastore statistics:

Statistic	Path	Action
Postgres	health check > backend stats > POSTGRESQL	Check for similar stats between standby and primary
HDFS	health check > semiaggr_s partitions	Check for similar stats between standby and primary
IR cache	n/a	Covered by verifying HDFS parity
InfluxDB	health check > backend stats > INFLUXDB	Check for similar stats between standby and primary
Redis	health check > backend stats > REDIS	Check for similar stats between standby and primary

Minor discrepancies between primary and standby stats are to be expected due to active ingest on primary and/or the four hour sync period.

Related answers from Splunk Community

Set up the standby Splunk UBA system

Verify that the standby system is set up correctly

Synchronize the primary and standby systems on-demand

Verify that the primary and standby systems are synchronized

Manual verification

Comments

Set up the standby Splunk UBA system

Was this topic useful?