[Containers] Disaster Recovery Procedures for Zookeeper and Elasticsearch Clusters

This document provides recovery steps to restore full functionality and prevent data loss in the event of a permanent datacenter outage.

Because Cassandra is deployed in a cross-datacenter replication (CDCR) setup, the permanent loss of one datacenter does not affect functionality. Therefore, this document focuses solely on the recovery procedures for Zookeeper and Elasticsearch.

Restore Zookeeper

CODE

# Step 1: Uninstall the existing Zookeeper release
helm delete $ZOOKEEPER_RELEASE_NAME

# Step 2: Force-delete any stuck pods in the failed datacenter
kubectl delete pod $ZOOKEEPER_POD_NAME --grace-period=0 --force

# Step 3: Edit values to restrict pods to the healthy datacenter only
# Remove the topologySpreadConstraints from the values file to prevent pods from being scheduled in the failed datacenter.
vi values/zookeeper-values.yaml

# Step 4: If the Zookeeper operator pod is also stuck in a "Terminating" state, force-delete it
kubectl delete pod $ZOOKEEPER_OPERATOR_POD --grace-period=0 --force

# Step 5: List and delete the Zookeeper PersistentVolumeClaims (PVCs)
oc get pvc | grep data-zookeeper
oc delete pvc $PVC_NAME

# Step 6: Reinstall the Zookeeper cluster
# Reinstall Zookeeper using the modified values file with a timeout of 60 minutes.
helm upgrade --install --timeout 60m -f values/zookeeper-values.yaml $ZOOKEEPER_RELEASE_NAME helm/zookeeper-0.2.15.tgz

# Step 7: Restore the last known backup
# a. Copy the backup file to the maintenance pod (place it under /tmp)
oc cp $BACKUP_FILE $MAINTENANCE_POD_NAME:/tmp/
# b. Run the restore script inside the maintenance pod
oc exec $MAINTENANCE_POD_NAME -- bash /scripts/restore-zookeeper.sh /tmp/$BACKUP_FILE

Restore Elasticsearch

CODE

# Step 1: Uninstall the existing Elasticsearch release
helm delete $ELASTICSEARCH_RELEASE_NAME

# Step 2: Force-delete any stuck pods in the failed datacenter
kubectl delete pod $ELASTICSEARCH_POD_NAME --grace-period=0 --force

# Step 3: Edit values to restrict pods to the healthy datacenter only
# Remove the topologySpreadConstraints from the values file.
vi values/elasticsearch-values.yaml

# Step 4: Reinstall the Elasticsearch cluster
# Reinstall Elasticsearch using the modified values file.
helm upgrade --install -f values/elasticsearch-values.yaml $ELASTICSEARCH_RELEASE_NAME helm/elasticsearch-0.2.2.tgz

# Note: The old PersistentVolumes should be mounted, and data should be available.

# Step 5 (Optional): Restore from backup if data is missing
# a. Copy the backup file to the maintenance pod (place it under /tmp)
oc cp $BACKUP_FILE $MAINTENANCE_POD_NAME:/tmp/
# b. Run the restore script inside the maintenance pod
oc exec $MAINTENANCE_POD_NAME -- bash /scripts/restore-elasticsearch.sh /tmp/$BACKUP_FILE