How Splunk UBA synchronizes the primary and standby systems

Critical data stored in the Postgres database, such as the threats and anomalies generated by Splunk UBA rules and models, is synchronized in real-time.

All other data, such as the data stores in HDFS, some metadata in Postgres, along with the data in Redis and InfluxDB, is synchronized every four hours. A checkpoint is created for each sync. When there is a failover, the standby Splunk UBA system begins to replay data ingestion from the last available checkpoint. The Splunk platform might retain new events for Splunk UBA to consume, but the search performance might be different depending on the time between the last checkpoint and when the standby system begins ingesting events. For example, some events in the Splunk platform might have been moved to the cold bucket, which negatively affects the search performance. In addition, since the ingestion lag in Splunk UBA is configurable per data source, some raw events with time stamps beyond the configured lag are excluded from the replay.

Additional data loss can occur if, at the time of the failover, there are events in the data pipeline that are not yet consumed by Splunk UBA and therefore cannot persist in Splunk UBA. These events are lost and can't be recovered during a failover operation.

The initial sync of full data transfer is triggered automatically when next scheduled job starts, defined by ReplicationCoordinator in /etc/caspida/local/conf/caspida-jobs.json.

How Splunk UBA synchronizes conf files between primary and standby systems during replication

All the configurations under /etc/caspida/local/conf (which contains Asset config files, Custom models, Internal IP and GeoLocation information, and Competitor domains) of the Primary system continue to sync to the management node of the standby system under the directory reference (/var/vcap/sys/run/caspida/conf) which is created during the setup of Warm- Standby in UBA.

When the failover is performed, the standby system transfers all the files from /var/vcap/sys/run/caspida/conf to /etc/caspida/local/conf on its management node. After failover, when sync cluster is performed to configure warm standby again, the system copies all the conf files from /etc/caspida/local/conf to all the cluster nodes under the same location.

Splunk UBA copies and updates the conf file from primary (/etc/caspida/local/conf) to standby (/var/vcap/sys/run/caspida/conf) and does not delete existing files which were synced earlier to the standby system.

Resolve synchronization failures between the primary and standby systems

If the synchronization between the primary and standby systems fail, perform the following tasks to resume the synchronization operations:

Run the following commands on the management node in the standby system:
```
/opt/caspida/bin/CaspidaCleanup 
/opt/caspida/bin/Caspida stop
```
Run the following command on the management node in the primary system:
```
/opt/caspida/bin/replication/setup standby -m primary -r
```
Run the following command on the management node in the standby system:
```
/opt/caspida/bin/replication/setup standby -m standby -r
```
Wait until an entry in the replication table shows up on standby system. In all deployments smaller than 20 nodes, run the following command on the management node to check. In 20-node clusters, run the command on node 2:
```
psql -d caspidadb -c 'select * from replication'
```

Run the curl command on the management node in the primary system to initiate a full sync:

curl -X POST -k -H "Authorization: Bearer $(grep '^\s*jobmanager.restServer.auth.user.token=' /opt/caspida/conf/uba-default.properties | cut -d'=' -f2)" https://localhost:9002/jobs/trigger?name=ReplicationCoordinator

Check the status of the replication table in both the primary and standby systems and verify the full cycle is triggered and there are no errors. Run the following command on the management node:
```
tail -f /var/log/caspida/replication/replication.log
```
Check the status of the replication table in both the primary and standby systems. After the replication is done, both the primary and standby systems have an Active status and the cycles should be synchronized. In all deployments smaller than 20 nodes, run the following command on the management node of both systems to check. In 20-node clusters, run the command on node 2 of both systems:
```
psql -d caspidadb -c 'select * from replication'
```

How Splunk UBA synchronizes the primary and standby systems

How Splunk UBA synchronizes conf files between primary and standby systems during replication

Resolve synchronization failures between the primary and standby systems

Comments

How Splunk UBA synchronizes the primary and standby systems

Was this topic useful?