Troubleshooting - Disaster Recovery

This article will present the procedure to recover the VeridiumID server.

The procedure is covering recovery for zookeeper, Cassandra and other configuration files, based on backups.

Backups

Configuration backups

All nodes starting with VeridiumID release 3.2.1 will have a backup done weekly (using cron jobs) of important configuration files that are present on the disk using the backup_configs.sh script.

To run the script you can use the following command as root user on any node:

BASH

bash /etc/veridiumid/scripts/backup_configs.sh /etc/veridiumid/scripts/backup_configs.conf

All backups will be archived and present in the following path: /opt/veridiumid/backup/all_configs

Database backups

Database backups are done weekly (using cron jobs) with the help of the Cassandra backup script.

To trigger a Cassandra backup use the following command as root user on all persistence nodes:

BASH

bash /opt/veridiumid/backup/cassandra/cassandra_backup.sh -c=/opt/veridiumid/backup/cassandra/cassandra_backup.conf

The backup will be present in the following path: /opt/veridiumid/backup/cassandra

Recovery

In order to recover from existing backups the following procedure must be followed:

1. Pre-requisites

1.1 Check for backups

Make sure that the config backup archives and the database directories are present on the nodes where the recovery will take place. In case they are not present copy them from the previous deployment/third party storage solution.

On the new environment create the transition file containing the old IP addresses and the new ones with which they will be replaced in the configuration files.

The transition file will look similar to the following:

BASH

OLD_IP1:NEW_IP1
OLD_IP2:NEW_IP2

for example:

10.2.3.1:10.3.4.1
10.2.3.2:10.3.4.2
10.2.3.3:10.3.4.3

In case the configuration revert will be performed on the same node as the backup or on nodes that have the same internal IP addresses as the old deployment, the transition file is no longer required.

Depending on the transition file the archives from each old node must be copied on the specified new node.

1.2 Check if Python 3 is installed

Make sure that you have Python3 installed on all nodes.

To check if you have Python 3 installed run the following command:

BASH

python3 --version

1.3 Running recomendations

Depending on the size of the database, some commands may take a lot of time. We recommend running the Cassandra restore commands under screen, in order to not have a terminal session expiration during the process.

To check if you have screen installed use the following command:

BASH

screen --help

To install screen use the following command as root:

BASH

yum -y install screen

To run commands under screen and detach use the following command format:

BASH

screen -d -m COMMAND

Where COMMAND is the linux command you wish to run under screen.

2. Stop all services

Connect to all nodes of the deployment and stop services using the following command as root user:

BASH

ver_stop

3. Revert the local configurations

Run the following command on all nodes as root user to revert local configurations for all services

BASH

bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE -t PATH_TO_TRANSITION_FILE

Where:

PATH_TO_BACKUP_ARCHIVE is the full path to the archive copied to the node (backup is done by script backup_configs.sh, that run weekly on each server).
PATH_TO_TRANSITION_FILE is the full path to the transition file created during step 1.1

In case of running on the same nodes/or nodes that have the same IP addresses the following command must be ran instead of the one above:

BASH

bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE

4. Revert the Zookeeper JSON files

Connect to all persistence nodes and start the Zookeeper service using the following command as root user:

BASH

service ver_zookeeper start

Connect to a webapp node and run the following command as root user:

BASH

bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE -t PATH_TO_TRANSITION_FILE -j

or in case of not using a transition file:

bash /etc/veridiumid/scripts/config_revert.sh -c /etc/veridiumid/scripts/backup_configs.conf -b PATH_TO_BACKUP_ARCHIVE -j

Where:

PATH_TO_BACKUP_ARCHIVE is the full path to the archive copied to the node
PATH_TO_TRANSITION_FILE is the full path to the transition file created during step 1.1

5. Start the Cassandra service

Connect to all persistence nodes and start the Cassandra service using the following command as root user:

BASH

service ver_cassandra start

6. Recreate the Cassandra Keyspace

Run the following command as root user on a single persistence node:

BASH

python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --create --transition PATH_TO_TRANSITION_FILE

Where:

PATH_TO_BACKUP is the full path of the backup directory mentioned in step 1.1, for example: /opt/veridiumid/backup/cassandra/2022-05-04_12-51
PATH_TO_TRANSITION_FILE is the full path of the transition file created during step 1.1

7. Restore the data from the backup

Run the following command as root user on all persistence nodes (this command can be run on all nodes in parallel):

BASH

python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --restore

Where PATH_TO_BACKUP is the full path of the backup directory mentioned in step 1.1, for example: /opt/veridiumid/backup/cassandra/2022-05-04_12-51

8. Rebuild indexes

Restart the Cassandra service (which was closed during the previous step automatically) using the following command as root user:

BASH

service ver_cassandra start

Run the following command as root user on all persistence nodes (this command can be run on all nodes in parallel) to rebuild the indexes:

BASH

python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --index

Where PATH_TO_BACKUP is the full path of the backup directory mentioned in step 1.1, for example: /opt/veridiumid/backup/cassandra/2022-05-04_12-51

9. Start all services

Connect to all nodes and run the following command as root user to start all services:

BASH

ver_start

10. How to restore Cassandra DB starting having only one backup (having multiple backups, the restore time is faster, because the nodes don’t need to be synchronized anymore)

CODE

### there is one full backup on node1 in DC1; there are also node2,node3 in DC1 and node1, node2, node3 in DC2
## leave cassandra running, only on the node on which backup exists - let's name it SeedNode; on all other nodes, stop cassandra.
## please remove all nodes from cluster, from SeedNode:
/opt/veridiumid/cassandra/bin/nodetool status
/opt/veridiumid/cassandra/bin/nodetool removenode 69704347-f7ac-4108-9d9d-d9ba4b325b0e

## after removing all nodes, there should be only this node in the cluster, this, where cassandra is running

## recreate schema, if necessary, only on SeedNode
python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --create

## restore from backup, only on SeedNode
python3 /opt/veridiumid/backup/cassandra/cassandra-restore.py --debug --backup PATH_TO_BACKUP --restore

## start cassandra on this node and wait until on column Host ID, on all hosts, the information is populated
service ver_cassandra start

### do the following procedure, one node at a time:
## remove old information about cassandra data
rm -fr /opt/veridiumid/cassandra/data/*; rm -fr /opt/veridiumid/cassandra/commitlog/*
## start node 
service ver_cassandra start
## follow the startup process and cluster join,  
/opt/veridiumid/cassandra/bin/nodetool describecluster
## run data syncronization between nodes
bash /opt/veridiumid/cassandra/conf/cassandra_maintenance.sh -c /opt/veridiumid/cassandra/conf/maintenance.conf
## see the status by running
/opt/veridiumid/cassandra/bin/nodetool compactionstats
## run compaction for each added node, in order to get correct information.
bash /opt/veridiumid/cassandra/conf/cassandra_maintenance.sh -c /opt/veridiumid/cassandra/conf/maintenance.conf -k

## the services will be functional after the backup was restored on 2 servers; please continue to restore the other nodes.

## also, in secondary datacenter, you can restore as follow, by running on each node.
rm -fr /opt/veridiumid/cassandra/data/*; rm -fr /opt/veridiumid/cassandra/commitlog/*
service ver_cassandra start
/opt/veridiumid/cassandra/bin/nodetool rebuild -dc dc1