Skip to content

DR Drill Playbook

Step-by-step runbooks for each disaster-recovery drill. Every drill has: goal, pre-checks, failure-simulation steps, recovery steps, post-verification, and cleanup (return to DC-primary).

Before running any drill

  • All drills are destructive to test. Run in a maintenance window or on synthetic workload.
  • Announce on the team channel first.
  • Take a pre-drill snapshot of every affected VM via KubeVirt (VirtualMachineSnapshot CR).
  • Open a time-tracking log: record every step's timestamp for RTO measurement.

Drill 0 — network cutover (the glue)

Every other drill ends up doing this. Separate runbook because it's small.

Goal: switch HAProxy backends + PowerDNS records so consumer URLs resolve to DR VMs.

Steps (~1 minute):

  1. On ops-runner, run: bash ssh ze@brac-poc-ops-runner-vm1-dc \ "/home/ze/drills/cutover-to-dr.sh" Script renders a Jinja template → new haproxy.cfg ConfigMap → commits to staxv-mono gitops → reload-watcher picks up in ~100ms.

  2. Verify: bash curl -sI https://minio.apps.brac-poc.comptech-lab.com/ | grep -i server # Should show DR backend header

  3. Similarly update PowerDNS records that point directly at VMs (names like vault-dc.opp.brac-poc.comptech-lab.com → swap to vault-dr IPs). Script: /home/ze/drills/pdns-switch-to-dr.sh

Cutback: run cutover-to-dc.sh when DC is healthy again.


Drill 1 — MinIO DR

RPO target: seconds · RTO target: near-zero

Pre-checks

```bash

Upload test object to DC

mc cp test-file.txt minio-dc/test-bucket/

Verify replicated to DR within 10 sec

sleep 10 && mc ls minio-dr/test-bucket/ | grep test-file.txt ```

Failure simulation

```bash

Stop MinIO service on all 3 DC nodes (simulate site outage)

for n in 1 2 3; do ssh ze@minio-vm${n}-dc "sudo systemctl stop minio" done ```

Recovery (network cutover only — MinIO DR is hot)

  1. Run Drill 0 network cutover
  2. Verify from external: bash mc alias set mc-poc https://minio.apps.brac-poc.comptech-lab.com \ $ACCESS_KEY $SECRET_KEY mc ls mc-poc/test-bucket/ # Should list the test file

Post-verification

  • Upload a new object after failover → verify it lands on DR
  • Vault backup upload still works (Vault cron writes to minio.apps... → goes to DR)
  • Nexus blob reads still work (Nexus is configured via same MinIO hostname)

Cleanup (return to DC)

```bash

Restart DC MinIO

for n in 1 2 3; do ssh ze@minio-vm${n}-dc "sudo systemctl start minio"; done

Wait for replication to catch up (DR → DC now)

mc admin replicate status minio-dr

Wait for "Healing" → "Healthy"

Network cutback

/home/ze/drills/cutover-to-dc.sh ```

Expected timings: cutover <2 min; full resync depends on delta.


Drill 2 — Vault DR

RPO target: ≤15 min · RTO target: ≤10 min

Pre-checks

```bash

Verify latest snapshot exists at MinIO

mc ls minio/vault-snapshots/ | tail -5

Should show recent timestamps <15 min old

```

Failure simulation

```bash

Stop all 3 DC Vault nodes

for n in 1 2 3; do ssh ze@vault-vm${n}-dc "sudo systemctl stop vault" done ```

Recovery

  1. Fetch latest snapshot on DR leader: bash ssh ze@vault-vm1-dr mc cp minio/vault-snapshots/$(mc ls minio/vault-snapshots/ | tail -1 | awk '{print $NF}') \ /tmp/latest.snap

  2. Init DR cluster + restore snapshot: bash # If this is first-time init, initialize with 5 unseal keys (hand back to custodians) # If previously initialized, just restore: vault operator raft snapshot restore /tmp/latest.snap

  3. Unseal (need 3 of 5 Shamir shares): bash vault operator unseal <share-1> vault operator unseal <share-2> vault operator unseal <share-3>

  4. Verify: bash vault status # should show sealed=false, standby=false on leader vault kv get secret/test # should read the test path vault read pki_int/cert/ca # intermediate CA should be back

  5. Repeat unseal on vault-vm2-dr and vault-vm3-dr — they auto-join the DR Raft cluster as followers.

  6. Network cutover (Drill 0) — cert-manager + External Secrets Operator now hit DR Vault.

Post-verification

  • External Secrets Operator reconciles an ExternalSecret → new Secret materialises → workload pod reads it: proves the whole pipeline
  • Issue a test leaf cert from the intermediate CA — should succeed

Data-loss assessment

  • Any secret written in the last ≤15 min of DC life is lost (still in DC Vault's Raft state that never snapshotted)
  • If this is unacceptable, shrink the snapshot interval to 5 min

Cleanup

Once DC comes back up: ```bash

On DC nodes: destroy Raft state (it's stale) and rejoin DR cluster as followers

ssh ze@vault-vm1-dc "sudo systemctl stop vault; sudo rm -rf /var/lib/vault/raft/*"

Then re-init via DR leader:

ssh ze@vault-vm1-dr "vault operator raft join https://vault-vm1-dr:8201"

Unseal DC node

Repeat for vm2, vm3

Cutback

/home/ze/drills/cutover-to-dc.sh ```


Drill 3 — PostgreSQL failover (any app's PG)

RPO target: seconds · RTO target: ~5 min

Demonstrated here for Keycloak's dedicated PG; the steps are identical for any per-app PG.

Pre-checks

```bash

On DC PG: write a sentinel row

ssh ze@keycloak-pg-vm1-dc sudo -u postgres psql -d keycloak -c \ "INSERT INTO public.dr_test (ts) VALUES (NOW());"

On DR PG: verify replication caught up (<1 sec)

ssh ze@keycloak-pg-vm1-dr sudo -u postgres psql -d keycloak -c \ "SELECT * FROM public.dr_test ORDER BY ts DESC LIMIT 1;" ```

Failure simulation

bash ssh ze@keycloak-pg-vm1-dc "sudo systemctl stop postgresql"

Recovery

  1. Promote DR to primary: bash ssh ze@keycloak-pg-vm1-dr "sudo -u postgres pg_ctlcluster 16 main promote" Verify: bash sudo -u postgres psql -c "SELECT pg_is_in_recovery();" # Expected: f (false)

  2. Update Keycloak app to point at DR PG: Via AWX: run job template keycloak-switch-db with var DB_HOST=keycloak-pg-vm1-dr. Or manually edit /etc/keycloak/keycloak.conf + restart Keycloak on both DC + DR Keycloak VMs.

  3. Verify Keycloak login still works: bash curl -s https://keycloak.brac-poc.comptech-lab.com/realms/brac-poc # Should return JSON for the realm

Cleanup

  1. On old-DC PG (now stopped): re-init as a streaming replica of DR: bash sudo -u postgres pg_basebackup -h keycloak-pg-vm1-dr -D /var/lib/postgresql/16/main \ -U replicator -P -R sudo systemctl start postgresql
  2. When ready for cutback: reverse the promotion by stopping DR-primary, promoting DC-replica, reconnect app.

Drill 4 — Redis failover

RPO target: seconds · RTO target: ~2 min

Pre-checks

```bash

Write sentinel key to master

redis-cli -h redis.apps.brac-poc.comptech-lab.com -a $PASS SET dr-test "pre-failover" ```

Failure simulation

```bash

Stop Redis master (assume redis-vm1-dc is currently master)

ssh ze@redis-vm1-dc "sudo systemctl stop redis-server redis-sentinel" ```

Recovery

Sentinel quorum (5 sentinels across DC + DR sites) detects loss, elects new master automatically.

Observe: ```bash

Via remaining sentinel

redis-cli -h redis-vm1-dr -p 26379 SENTINEL master mymaster

Should show new master IP (a DR node)

```

Verify from app side: ```bash redis-cli -h redis.apps.brac-poc.comptech-lab.com -a $PASS GET dr-test

Returns "pre-failover" — data preserved; connection automatically routed

```

Cleanup

```bash

Restart DC master; it auto-rejoins as a replica of the new master

ssh ze@redis-vm1-dc "sudo systemctl start redis-server redis-sentinel"

Optionally failback (not necessary)

redis-cli -h redis-vm1-dr -p 26379 SENTINEL failover mymaster ```


Drill 5 — Hub failover (hub-dc → hub-dr)

RPO target: ≤15 min · RTO target: ~30 min

Pre-checks

```bash

Verify latest ACM backup exists

oc -n open-cluster-management-backup get backup | tail -5

Verify backups are in MinIO

mc ls minio/acm-hub-backup/ | tail -5 ```

Failure simulation

Shut down hub-dc's SNO node (simulate site outage): bash ssh ze@hub-dc-node "sudo shutdown -h now"

(Alternatively: block hub-dc's API/ingress VIPs at HAProxy to simulate network partition.)

Recovery

  1. Verify hub-dc is truly down: bash curl -kI https://api.hub-dc.opp.brac-poc.comptech-lab.com:6443 # Should timeout

  2. Log in to hub-dr: oc login api.hub-dr.opp.brac-poc.comptech-lab.com:6443

  3. Restore ACM from backup: yaml # Apply Restore CR apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Restore metadata: name: restore-acm-hub-dr namespace: open-cluster-management-backup spec: veleroManagedClustersBackupName: latest veleroCredentialsBackupName: latest veleroResourcesBackupName: latest

  4. Wait for ManagedClusters to re-register: bash oc get managedcluster # Spokes may show "Unknown" initially; bootstrap kubeconfig rotation fixes it

  5. Re-bootstrap klusterlet on spokes (one-time, scripted): bash # For each spoke, extract new import.yaml from hub-dr and apply: oc get secret -n <cluster-name> import-yaml -o jsonpath='{.data.import\.yaml}' \ | base64 -d | oc --context=<spoke> apply -f -

  6. Verify end-to-end:

  7. ACM UI on hub-dr shows all 3 spokes Healthy
  8. ApplicationSets continue syncing (ArgoCD on hub-dr was part of the restore)
  9. Policy framework compliance re-reporting

Cleanup (cutback after hub-dc repair)

This is non-trivial — hub-dr has been accumulating state. Options: - A) Promote hub-dr permanently — hub-dc becomes the new DR when it comes back, backup/restore direction flips. Simpler. - B) Backup hub-dr → restore on hub-dc → demote hub-dr. Traditional, harder to script.

For the POC we demo option A.


Drill 6 — Spoke failover (spoke-dc → spoke-dr)

RPO target: workloads' own RPO (deps on PG/Redis above) · RTO target: ~5 min

Pre-checks

  • ArgoCD ApplicationSets show both spoke-dc and spoke-dr Healthy
  • Workloads are deployed on both (run oc get pods -A on each)

Failure simulation

```bash

Take all 3 spoke-dc nodes offline

for n in 1 2 3; do ssh ze@spoke-dc-node${n} "sudo shutdown -h now"; done ```

Recovery

Automatic: HAProxy health-checks mark spoke-dc API and ingress backends as down → traffic flips to spoke-dr backends. End-user URLs unchanged.

Observe: ```bash

Route traffic

curl -sI https://console-openshift-console.routes.spoke-dc.opp.brac-poc.comptech-lab.com/

May 5xx for a moment, then resolve to spoke-dr's cert

Check ArgoCD status — should show DC pods unhealthy, DR continues

oc --context=hub-dc get application -A ```

Cleanup

```bash

Bring spoke-dc back online

Nodes PXE/reboot back to Ready

ArgoCD self-heals → workloads return to Running on spoke-dc

ApplicationSet targets both spokes → no manual intervention

```


Drill 7 — Full site failover (integrated)

Goal: demonstrate end-to-end DC-outage → DR operation → full application access. Combines drills 0–6.

Sequence

  1. Announce + snapshot all affected VMs
  2. Drill 0: pre-flight (verify MinIO, Vault, PG all healthy at DR)
  3. Simulate DC outage: shut down all DC-named VMs + OCP nodes at DC site
  4. Drill 1: MinIO cutover (most urgent — everything depends on it)
  5. Drill 2: Vault restore + unseal on DR
  6. Drill 3: PG failovers (Keycloak PG, WSO2 APIM PG, WSO2 IS PG, GitLab PG, Temporal PG, n8n PG, AWX PG, Terrakube PG)
  7. Drill 4: Redis Sentinel auto-failover
  8. Drill 5: Hub OCP failover (ACM restore on hub-dr)
  9. Drill 6: Spoke workload cutover (automatic via HAProxy)
  10. Verify user journeys:
    • Log in to Keycloak → DR
    • Browse GitLab → DR
    • Hit OCP console for spoke-dr
    • View SigNoz dashboard → traces from spoke-dr
    • Run a Temporal workflow
    • Trigger a Jenkins build (if Jenkins-dr has config)

Target total RTO

  • Individual drills: ~5–30 min each
  • Sequential full failover: ~60–90 min
  • Parallelized (multiple operators, drills in parallel): ~30–40 min

Demo posture for BRAC

For the Day-6 demo: rehearse the full drill during Days 4-5, record a 10-15 min video. Demo day: show the pre-recorded drill + narrate, take Q&A. (Doing the full drill live on demo day is too risky; record beforehand and show.)


Smoke drills (automated, run daily)

Small scripted checks on ops-runner that prove DR endpoints are alive without actually failing over. Runs via cron + AWX job.

Check How Alert if fails
minio.apps.brac-poc… resolves to DC dig +short PagerDuty
MinIO-DR health mc admin info minio-dr PagerDuty
Vault-DR unsealed + replicated vault status + list policies PagerDuty
DR PG streaming lag query pg_stat_replication PagerDuty if lag >60s
ACM cluster-backup recent list Velero backups, check timestamp PagerDuty if gap >30 min
MinIO replication healthy mc admin replicate status PagerDuty

Roles and responsibilities

Role During drill Handle
Project Lead Coordinates, announces Timer + narrative
Infrastructure Lead Shut-down commands, OCP steps hub/spoke drills
Platform Lead Vault, MinIO, Redis, SigNoz tier-specific drills
DevOps Lead GitLab, Jenkins, Nexus, AWX CI/CD tier drills
Security Lead Policy compliance re-check, audit log of drill evidence package

After each drill, compile a short report (timings, issues, fixes) — add to reports/dr-drills/YYYY-MM-DD.md in the repo.


Open items to execute

  • Scripts on ops-runner under /home/ze/drills/ — written in the execution session that provisions ops-runner
  • Vault snapshot cron — in the Vault Ansible role
  • ACM BackupSchedule CR — in the GitOps policies/ folder
  • PostgreSQL streaming replication — in the PG Ansible role (per-app)
  • MinIO site replication setup — in the MinIO Ansible role

Created: 2026-04-24 · Owner: Infrastructure + Platform Leads · Status: playbook drafted, scripts pending execution session