Backup and restore Splunk UBA using automated incremental backups
Automatic incremental backup and restore is a beta feature and must be implemented with the assistance of Splunk Support.
Attach an additional disk to the Splunk UBA management node in your deployment and configure automated incremental backups.
- Periodic incremental backups are performed without stopping Splunk UBA. You can configure the frequency of these backups by configuring the cron job in
/opt/caspida/conf/jobconf/caspida-jobs.json
. - A weekly full backup is performed without stopping Splunk UBA. You can configure the frequency of these backups using the
backup.filesystem.full.interval
property.
You can use incremental backup and restore as an HA/DR solution that is less resource-intensive than the warm standby solution described in Configure warm standby in Splunk UBA.
Configure incremental backups in Splunk UBA
Perform the following steps to configure incremental backups of your Splunk UBA deployment:
- On the Splunk UBA management node, attach an additional disk dedicated for filesystem backup. For example, mount a device on a local directory on the Splunk UBA management node.
In 20-node clusters, Postgres services run on node 2 instead of node 1. You will need two additional disks - one on node 1 and a second on node 2 for the Postgres services, or you may have a shared storage device that can be accessed by both nodes. Make sure that the backup folder on both nodes is the same. By default, the backups are written to the/backup
folder. - Stop Splunk UBA.
/opt/caspida/bin/Caspida stop
- Create a dedicated directory on the management node and change directory permissions so that backup files can be written into the directory. If warm standby is also configured, perform these tasks on the management node in the primary cluster.
sudo mkdir /backup sudo chmod 777 /backup
- Mount the dedicated device on the backup directory. For example, a new 5TB hard drive mounted on the backup directory:
caspida@node1:~$ df -h /dev/sdc Filesystem Size Used Avail Use% Mounted on /dev/sdc 5.0T 1G 4.9T 1% /backup
- Add the following properties into
/etc/caspida/local/conf/uba-site.properties
:backup.filesystem.enabled=true backup.filesystem.directory.restore=/backup
- Synchronize the configuration across the cluster:
/opt/caspida/bin/Caspida sync-cluster
- Register filesystem backup:
/opt/caspida/bin/replication/setup filesystem
If the same host has been registered before, run the command again with the reset flag:
/opt/caspida/bin/replication/setup filesystem -r
- Enable Postgres archiving.
- Create the directory where archives will be stored. For example,
/backup/wal_archive
:mkdir /backup/wal_archive chown postgres:postgres /backup/wal_archive
- Create a file called
archiving.conf
on the PostgreSQL node (node 2 for 20-node deployments, node 1 for all other deployments). On RHEL, Oracle Linux, and CentOS systems:cd /var/vcap/store/pgsql/10/data/conf.d/ sudo mv archiving.conf.sample archiving.conf
On Ubuntu systems:
If your archive directory is notcd /etc/postgresql/10/main/conf.d/ sudo mv archiving.conf.sample archiving.conf
/backup/wal_archive
, editarchiving.conf
to change the archive directory. - Restart PostgreSQL services on the master node:
/opt/caspida/bin/Caspida stop-postgres /opt/caspida/bin/Caspida start-postgres
- Create the directory where archives will be stored. For example,
- On the management node:
- In the primary cluster, enable the replication system job by adding the
ReplicationCoordinator
property into the/etc/caspida/local/conf/caspida-jobs.json
file. Below is a sample of the file before adding the property:/** * Copyright 2014 - Splunk Inc., All rights reserved. * This is Caspida proprietary and confidential material and its use * is subject to license terms. */ { "systemJobs": [ { // "name" : "ThreatComputation", // "cronExpr" : "0 0 0/1 * * ?", // "jobArguments" : { "env:CASPIDA_JVM_OPTS" : "-Xmx4096M" } } ] }
After adding the property, the file should look like this:
/** * Copyright 2014 - Splunk Inc., All rights reserved. * This is Caspida proprietary and confidential material and its use * is subject to license terms. */ { "systemJobs": [ { // "name" : "ThreatComputation", // "cronExpr" : "0 0 0/1 * * ?", // "jobArguments" : { "env:CASPIDA_JVM_OPTS" : "-Xmx4096M" } }, { "name" : "ReplicationCoordinator", "enabled" : true } ] }
- Run the following command to synchronize the cluster:
/opt/caspida/bin/Caspida sync-cluster
- In the primary cluster, enable the replication system job by adding the
- Start Splunk UBA:
/opt/caspida/bin/Caspida start
An initial full backup is triggered automatically when next scheduled job starts, as defined by the ReplicationCoordinator
property in the /opt/caspida/conf/jobconf/caspida-jobs.json
file. After the initial full backup, a series of incremental backups is performed until the next scheduled full backup. By default, Splunk UBA performs a full backup every 7 days. To change this interval, perform the following tasks:
- Log in to the Splunk UBA master node as the caspida user.
- Edit the
backup.filesystem.full.interval
property in/etc/caspida/local/conf/uba-site.properties
. - Synchronize the cluster.
/opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf
Perform periodic cleanup of the backup files
When a new full backup is completed, it is located in the caspida
directory. All existing directories in the caspida
directory are moved to delete
directory. You can safely remove all content in the delete
directory to help minimize the number of files retained on the system, while also preserving recovery capability to the latest checkpoint.
In the following example, it is safe to remove all backup directories 0000021
to 0000038
in /backup/delete/
, while keeping 1000039
to 0000045
in /backup/caspida/
. The 1000039
folder contains a full backup, while all the other directories starting with zero contain incremental backups.
caspida@node1:~$ ls -t /backup/caspida/ /backup/delete/ /backup/caspida/: 0000045 0000044 0000043 0000042 0000041 0000040 1000039 /backup/delete/: 0000038 0000036 1000034 0000032 0000030 0000028 0000026 0000024 0000022 1000020 0000037 0000035 0000033 0000031 0000029 0000027 0000025 0000023 0000021
Restore Splunk UBA from incremental backups
To restore Splunk UBA from online incremental backup files, at least one base backup directory containing a full backup must exist.
- A base directory has a sequence number starting with
1
. - An incremental directory has a sequence number starting with
0
.
For example. if a filesystem backup is setup with BACKUP_HOME
as /backup/caspida
, the following is a valid base directory with three incremental directories:
caspida@node1:~$ du -sh /backup/caspida/* 1.5G /backup/caspida/0000124 1.5G /backup/caspida/0000125 1.4G /backup/caspida/0000126 35G /backup/caspida/1000123
The base directory 1000123
contains 35GB worth of backup files, while the incremental directories 0000124
, 0000125
and 0000126
have only backup files around 1.5GB for each.
A restore can be performed in the following scenarios:
- From a base directory with all incremental directories. Using our example, this includes all of
1000123
,0000124
,0000125
, and0000126
so Splunk UBA is restored to the latest checkpoint. - From a base directory with some incremental directory with contiguous sequences. Using our example, we can use
1000123
,0000124
and0000125
. The1000123
and0000125
directories cannot be used without0000124
as it skips the sequence number. - From a base directory only, such as
1000123
in our example.
Steps to restore
This example shows how to restore from a base directory 1000123
with all of the incremental directories 0000124
, 0000125
, and 0000126
.
- Get the server ready to restore (which can be either the original server or a separate one). If there is any existing data, run:
/opt/caspida/bin/CaspidaCleanup
- Stop all services:
/opt/caspida/bin/Caspida stop-all
- Restore Postgres.
- On the Postgres node (node 2 in 20-node deployments, node 1 in all other deployments), clean up any existing data. Run the following command on RHEL, OEL, or CentOS systems:
sudo rm -rf /var/lib/pgsql/10/main/*
On Ubuntu systems, run the following command:
sudo rm -rf /var/lib/postgresql/10/main/*
- Copy all content under
<base directory>/postgres/base
to the Postgres node. For example, if you. are copying from different server on a RHEL, OEL, or CentOS system, use teh following command:scp -r caspida@ubap1:<BACKUP_HOME>/1000123/postgres/base/* /var/lib/pgsql/10/main
On Ubuntu systems, use the following command:
scp -r caspida@ubap1:<BACKUP_HOME>/1000123/postgres/base/* /var/lib/postgresql/10/main
- (Skip this step for only base backup directory without any incremental)
Remove unnecessary WAL files. On RHEL, OEL, or CentOS systems, run the following command:rm -rf /var/lib/pgsql/10/main/pg_wal/*
On Ubuntu systems, run the following command:
rm -rf /var/lib/postgresql/10/main/pg_wal/*
Make sure the system has access to Postgres WAL archive directory. Modify the
/var/lib/pgsql/10/main/recovery.conf
(on RHEL, OEL, and CentOS systems) or/var/lib/postgresql/10/main/recovery.conf
(on Ubuntu systems) file and remove all contents in the file, then add the following properties:restore_command = 'cp <WAL directory>/%f "%p"' recovery_target_time = '<recovery timestamp>' recovery_target_action = 'promote'
Where
<WAL directory>
is the directory with all Postgres WAL files, and<recovery timestamp>
is the timestamp in backup file<BACKUP_HOME>/0000126/postgres/recovery_target_time
.
For example, therecovery.conf
file looks like this:restore_command = 'cp /backup/wal_archive/%f "%p"' recovery_target_time = '2019-09-16 12:36:03' recovery_target_action = 'promote'
- Change ownership of the backup files. On RHEL, OEL, or CnetOS systems, run the following command:
chown -R postgres:postgres /var/lib/pgsql/10/main
On Ubuntu systens, run the following command:
chown -R postgres:postgres /var/lib/postgresql/10/main
- Start Postgres service by running this on master node:
/opt/caspida/bin/Caspida start-postgres
Monitor Postgres logs under
/var/log/postgres
, which show the recovering process. Once the recovery completes, query Postgres to see if data is recovered. For example, run the following command from the Postgres CLI:psql -d caspidadb -c 'SELECT * FROM dbinfo'
- After the Postgres service is updated, run the following command on the node where Postgres is running. On RHEL, OEL, or CentOS systems, run the following command:
sudo -u postgres /usr/lib/pgsql/10/bin/pg_ctl promote -D /var/lib/postgresql/10/main
On Ubuntu systems, run the following command:
View yoursudo -u postgres /usr/lib/postgresql/10/bin/pg_ctl promote -D /var/lib/postgresql/10/main
/etc/caspida/local/conf/caspida-deployment.conf
file to see where Postgres is running on in your deployment.
- On the Postgres node (node 2 in 20-node deployments, node 1 in all other deployments), clean up any existing data. Run the following command on RHEL, OEL, or CentOS systems:
- Restore Redis. Redis backups are full backups, even for incremental Splunk UBA backups. You can restore Redis from any backup directory, such as the most recent incremental backup directory. In our example, we can backup Redis from the
0000126
incremental backup directory. The Redis backup file ends with the node number. Be sure to restore the backup file on the correct corresponding node. For example, in a 5-node cluster, the Redis file must be restored on nodes 4 and 5. Assuming the backup files are on node 1, run the following command on node 4 to restore Redis:scp caspida@node1:<BACKUP_HOME>/0000126/redis/redis-server.rdb.4 /var/vcap/store/redis/redis-server.rdb
Similarly, run the following command on node 5:
View yourscp caspida@node1:<BACKUP_HOME>/0000126/redis/redis-server.rdb.5 /var/vcap/store/redis/redis-server.rdb
/etc/caspida/local/conf/caspida-deployment.conf
file to see where Redis is running on in your deployment. - Restore InfluxDB. Similar to Redis, InfluxDB backups are full backups. You can restore InfluxDB from the most recent backup directory. In this example, InfluxDB is restored from the
0000126
incremental backup directory. On the management node, which hosts InfluxDB, start InfluxDB, clean it up, and restore from backup files:sudo service influxdb start influx -execute "DROP DATABASE caspida" influx -execute "DROP DATABASE ubaMonitor" influxd restore -portable <BACKUP_HOME>/0000126/influx
- Restore HDFS. To restore HDFS, we need to first restore base, and incremental data in continues sequence. In our example, we first restore from 1000123, then 0000124, 0000125 and 0000126.
- Start the necessary services. On the management node, run the following command:
/opt/caspida/bin/Caspida start-all --no-caspida
- Restore HDFS from the base backup directory:
nohup hadoop fs -copyFromLocal <BACKUP_HOME>/1000123/hdfs/caspida /user &
Restoring HDFS can take a long time. Check the process ID to see if the restore is completed. For example if the PID is 111222, check by using the following command:
ps 111222
- Repeat the previous step for the incremental backup directories
0000124
,0000125
, and0000126
.If you have a large number of incremental backup directories, you can write a script containing all the commands, then run the script.nohup hadoop fs -copyFromLocal <BACKUP_HOME>/0000124/hdfs/caspida /user & nohup hadoop fs -copyFromLocal <BACKUP_HOME>/0000125/hdfs/caspida /user & nohup hadoop fs -copyFromLocal <BACKUP_HOME>/0000126/hdfs/caspida /user &
- Change owner in HDFS:
sudo -u hdfs hdfs dfs -chown -R impala:caspida /user/caspida/analytics sudo -u hdfs hdfs dfs -chown -R mapred:hadoop /user/history sudo -u hdfs hdfs dfs -chown -R impala:impala /user/hive sudo -u hdfs hdfs dfs -chown -R yarn:yarn /user/yarn
- If the server you are restoring to is different from the one where the backup was taken, run the following commands to update the metadata:
Note the host is node1 in deployment file.
hive --service metatool -updateLocation hdfs://<RESTORE_HOST>:8020 hdfs://<BACKUP_HOST>:8020 impala-shell -q "INVALIDATE METADATA"
- Start the necessary services. On the management node, run the following command:
- Restore your rules and customized configurations from the latest backup directory:
- Restore the configurations:
cp -pr <BACKUP_HOME>/0000126/conf/* /etc/caspida/local/conf/
- Restore the rules:
rm -Rf /opt/caspida/conf/rules/* cp -prf <BACKUP_HOME>/0000126/rule/* /opt/caspida/conf/rules/
- Restore the configurations:
- Start the server:
Check the Splunk UBA web UI to make sure the server is operational.
/opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf /opt/caspida/bin/CaspidaCleanup container-grouping /opt/caspida/bin/Caspida start
- If the server for backup and restore are different, perform the following tasks:
- Update the data source metadata:
Replace
curl -X PUT -Ssk -v -H "Authorization: Bearer $(grep '^\s*jobmanager.restServer.auth.user.token=' /opt/caspida/conf/uba-default.properties | cut -d'=' -f2)" https://localhost:9002/datasources/moveDS?name=<DS_NAME>
<DS_NAME>
with the data source name displayed in Splunk UBA. - Trigger a one-time sync with Splunk ES:
If your Splunk ES host did not change, run the following command:
If you are pointing to a different Splunk ES host, edit the host in Splunk UBA to automatically trigger a one-time sync.
curl -X POST 'https://localhost:9002/jobs/trigger?name=EntityScoreUpdateExecutor' -H "Authorization: Bearer $(grep '^\s*jobmanager.restServer.auth.user.token=' /opt/caspida/conf/uba-default.properties | cut -d'=' -f2)" -H 'Content-Type: application/json' -d '{"schedule": false}' -k
- Update the data source metadata:
Prepare to backup Splunk UBA | Migrate Splunk UBA using the backup and restore scripts |
This documentation applies to the following versions of Splunk® User Behavior Analytics: 5.0.0, 5.0.1, 5.0.2, 5.0.3
Feedback submitted, thanks!