Troubleshooting - Disaster Recovery
This article will present the procedure to recover the VeridiumID server.
The procedure is covering recovery for zookeeper, Cassandra and other configuration files, based on backups.
Backups
Configuration backups
All nodes starting with VeridiumID release 3.2.1 will have a backup done weekly (using cron jobs) of important configuration files that are present on the disk using the backup_configs.sh script.
To run the script you can use the following command as root user on any node:
bash /etc/veridiumid/scripts/backup_configs.sh /etc/veridiumid/scripts/backup_configs.conf
All backups will be archived and present in the following path: /opt/veridiumid/backup/all_configs
Database backups
Database backups are done weekly (using cron jobs) with the help of the Cassandra backup script.
To trigger a Cassandra backup use the following command as root user on all persistence nodes:
bash /opt/veridiumid/backup/cassandra/cassandra_backup.sh -c=/opt/veridiumid/backup/cassandra/cassandra_backup.conf
The backup will be present in the following path: /opt/veridiumid/backup/cassandra
Recovery
In order to recover from existing backups the following procedure must be followed:
1. Pre-requisites
1.1 Check for backups
Make sure that the config backup archives and the database directories are present on the nodes where the recovery will take place. In case they are not present copy them from the previous deployment/third party storage solution.
On the new environment create the transition file containing the old IP addresses and the new ones with which they will be replaced in the configuration files.
The transition file will look similar to the following:
OLD_IP1:NEW_IP1
OLD_IP2:NEW_IP2
for example:
10.2.3.1:10.3.4.1
10.2.3.2:10.3.4.2
10.2.3.3:10.3.4.3
In case the configuration revert will be performed on the same node as the backup or on nodes that have the same internal IP addresses as the old deployment, the transition file is no longer required.
Depending on the transition file the archives from each old node must be copied on the specified new node.
1.2 Check if Python 3 is installed
Make sure that you have Python3 installed on all nodes.
To check if you have Python 3 installed run the following command:
python3 --version
1.3 Running recomendations
Depending on the size of the database, some commands may take a lot of time. We recommend running the Cassandra restore commands under screen, in order to not have a terminal session expiration during the process.
To check if you have screen installed use the following command:
screen --help
To install screen use the following command as root:
yum -y install screen
To run commands under screen and detach use the following command format:
screen -d -m COMMAND
Where COMMAND is the linux command you wish to run under screen.
2. Stop all services
Connect to all nodes of the deployment and stop services using the following command as root user:
ver_stop
3. Revert the local configurations
Run the following command on all nodes as root user to revert local configurations for all services
bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE -t PATH_TO_TRANSITION_FILE
Where:
PATH_TO_BACKUP_ARCHIVE is the full path to the archive copied to the node (backup is done by script
backup_configs.sh
, that run weekly on each server).PATH_TO_TRANSITION_FILE is the full path to the transition file created during step 1.1
In case of running on the same nodes/or nodes that have the same IP addresses the following command must be ran instead of the one above:
bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE
4. Revert the Zookeeper JSON files
Connect to all persistence nodes and start the Zookeeper service using the following command as root user:
service ver_zookeeper start
Connect to a webapp node and run the following command as root user:
bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE -t PATH_TO_TRANSITION_FILE -j
or in case of not using a transition file:
bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE -j
Where:
PATH_TO_BACKUP_ARCHIVE is the full path to the archive copied to the node
PATH_TO_TRANSITION_FILE is the full path to the transition file created during step 1.1
5. Start the Cassandra service
Connect to all persistence nodes and start the Cassandra service using the following command as root user:
service ver_cassandra start
6. Recreate the Cassandra Keyspace
Run the following command as root user on a single persistence node:
python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --create --transition PATH_TO_TRANSITION_FILE
Where:
PATH_TO_BACKUP is the full path of the backup directory mentioned in step 1.1, for example: /opt/veridiumid/backup/cassandra/2022-05-04_12-51
PATH_TO_TRANSITION_FILE is the full path of the transition file created during step 1.1
7. Restore the data from the backup
Run the following command as root user on all persistence nodes (this command can be run on all nodes in parallel):
python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --restore
Where PATH_TO_BACKUP is the full path of the backup directory mentioned in step 1.1, for example: /opt/veridiumid/backup/cassandra/2022-05-04_12-51
8. Rebuild indexes
Restart the Cassandra service (which was closed during the previous step automatically) using the following command as root user:
service ver_cassandra start
Run the following command as root user on all persistence nodes (this command can be run on all nodes in parallel) to rebuild the indexes:
python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --index
Where PATH_TO_BACKUP is the full path of the backup directory mentioned in step 1.1, for example: /opt/veridiumid/backup/cassandra/2022-05-04_12-51
9. Start all services
Connect to all nodes and run the following command as root user to start all services:
ver_start
10. How to restore Cassandra DB starting having only one backup (having multiple backups, the restore time is faster, because the nodes don’t need to be synchronized anymore)
### there is one full backup on node1 in DC1; there are also node2,node3 in DC1 and node1, node2, node3 in DC2
## leave cassandra running, only on the node on which backup exists - let's name it SeedNode; on all other nodes, stop cassandra.
## please remove all nodes from cluster, from SeedNode:
/opt/veridiumid/cassandra/bin/nodetool status
/opt/veridiumid/cassandra/bin/nodetool removenode 69704347-f7ac-4108-9d9d-d9ba4b325b0e
## after removing all nodes, there should be only this node in the cluster, this, where cassandra is running
## recreate schema, if necessary, only on SeedNode
python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --create
## restore from backup, only on SeedNode
python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --restore
## start cassandra on this node and wait until on column Host ID, on all hosts, the information is populated
service ver_cassandra start
### do the following procedure, one node at a time:
## remove old information about cassandra data
rm -fr /opt/veridiumid/cassandra/data/*; rm -fr /opt/veridiumid/cassandra/commitlog/*
## start node
service ver_cassandra start
## follow the startup process and cluster join,
/opt/veridiumid/cassandra/bin/nodetool describecluster
## run data syncronization between nodes
bash /opt/veridiumid/cassandra/conf/cassandra_maintenance.sh -c /opt/veridiumid/cassandra/conf/maintenance.conf
## see the status by running
/opt/veridiumid/cassandra/bin/nodetool compactionstats
## run compaction for each added node, in order to get correct information.
bash /opt/veridiumid/cassandra/conf/cassandra_maintenance.sh -c /opt/veridiumid/cassandra/conf/maintenance.conf -k
## the services will be functional after the backup was restored on 2 servers; please continue to restore the other nodes.
## also, in secondary datacenter, you can restore as follow, by running on each node.
rm -fr /opt/veridiumid/cassandra/data/*; rm -fr /opt/veridiumid/cassandra/commitlog/*
service ver_cassandra start
/opt/veridiumid/cassandra/bin/nodetool rebuild -dc dc1