DR Drill Playbook¶

Step-by-step runbooks for each disaster-recovery drill. Every drill has: goal, pre-checks, failure-simulation steps, recovery steps, post-verification, and cleanup (return to DC-primary).

Before running any drill

All drills are destructive to test. Run in a maintenance window or on synthetic workload.
Announce on the team channel first.
Take a pre-drill snapshot of every affected VM via KubeVirt (VirtualMachineSnapshot CR).
Open a time-tracking log: record every step's timestamp for RTO measurement.

Drill 0 — network cutover (the glue)¶

Every other drill ends up doing this. Separate runbook because it's small.

Goal: switch HAProxy backends + PowerDNS records so consumer URLs resolve to DR VMs.

Steps (~1 minute):

On ops-runner, run: bash ssh ze@brac-poc-ops-runner-vm1-dc \ "/home/ze/drills/cutover-to-dr.sh" Script renders a Jinja template → new haproxy.cfg ConfigMap → commits to staxv-mono gitops → reload-watcher picks up in ~100ms.
Verify: bash curl -sI https://minio.apps.brac-poc.comptech-lab.com/ | grep -i server # Should show DR backend header
Similarly update PowerDNS records that point directly at VMs (names like vault-dc.opp.brac-poc.comptech-lab.com → swap to vault-dr IPs). Script: /home/ze/drills/pdns-switch-to-dr.sh

Cutback: run cutover-to-dc.sh when DC is healthy again.

Drill 1 — MinIO DR¶

RPO target: seconds · RTO target: near-zero

Pre-checks¶

```bash

Upload test object to DC¶

mc cp test-file.txt minio-dc/test-bucket/

Verify replicated to DR within 10 sec¶

sleep 10 && mc ls minio-dr/test-bucket/ | grep test-file.txt ```

Failure simulation¶

```bash

Stop MinIO service on all 3 DC nodes (simulate site outage)¶

for n in 1 2 3; do ssh ze@minio-vm${n}-dc "sudo systemctl stop minio" done ```

Recovery (network cutover only — MinIO DR is hot)¶

Run Drill 0 network cutover
Verify from external: bash mc alias set mc-poc https://minio.apps.brac-poc.comptech-lab.com \ $ACCESS_KEY $SECRET_KEY mc ls mc-poc/test-bucket/ # Should list the test file

Post-verification¶

Upload a new object after failover → verify it lands on DR
Vault backup upload still works (Vault cron writes to minio.apps... → goes to DR)
Nexus blob reads still work (Nexus is configured via same MinIO hostname)

Cleanup (return to DC)¶

```bash

Restart DC MinIO¶

for n in 1 2 3; do ssh ze@minio-vm${n}-dc "sudo systemctl start minio"; done

Wait for replication to catch up (DR → DC now)¶

mc admin replicate status minio-dr

Wait for "Healing" → "Healthy"¶

Network cutback¶

/home/ze/drills/cutover-to-dc.sh ```

Expected timings: cutover <2 min; full resync depends on delta.

Drill 2 — Vault DR¶

RPO target: ≤15 min · RTO target: ≤10 min

Pre-checks¶

```bash

Verify latest snapshot exists at MinIO¶

mc ls minio/vault-snapshots/ | tail -5

Should show recent timestamps <15 min old¶

```

Failure simulation¶

```bash

Stop all 3 DC Vault nodes¶

for n in 1 2 3; do ssh ze@vault-vm${n}-dc "sudo systemctl stop vault" done ```

Recovery¶

Fetch latest snapshot on DR leader: bash ssh ze@vault-vm1-dr mc cp minio/vault-snapshots/$(mc ls minio/vault-snapshots/ | tail -1 | awk '{print $NF}') \ /tmp/latest.snap
Init DR cluster + restore snapshot: bash # If this is first-time init, initialize with 5 unseal keys (hand back to custodians) # If previously initialized, just restore: vault operator raft snapshot restore /tmp/latest.snap
Unseal (need 3 of 5 Shamir shares): bash vault operator unseal <share-1> vault operator unseal <share-2> vault operator unseal <share-3>
Verify: bash vault status # should show sealed=false, standby=false on leader vault kv get secret/test # should read the test path vault read pki_int/cert/ca # intermediate CA should be back
Repeat unseal on vault-vm2-dr and vault-vm3-dr — they auto-join the DR Raft cluster as followers.
Network cutover (Drill 0) — cert-manager + External Secrets Operator now hit DR Vault.

Post-verification¶

External Secrets Operator reconciles an ExternalSecret → new Secret materialises → workload pod reads it: proves the whole pipeline
Issue a test leaf cert from the intermediate CA — should succeed

Data-loss assessment¶

Any secret written in the last ≤15 min of DC life is lost (still in DC Vault's Raft state that never snapshotted)
If this is unacceptable, shrink the snapshot interval to 5 min

Cleanup¶

Once DC comes back up: ```bash

On DC nodes: destroy Raft state (it's stale) and rejoin DR cluster as followers¶

ssh ze@vault-vm1-dc "sudo systemctl stop vault; sudo rm -rf /var/lib/vault/raft/*"

Then re-init via DR leader:¶

ssh ze@vault-vm1-dr "vault operator raft join https://vault-vm1-dr:8201"

Unseal DC node¶

Repeat for vm2, vm3¶

Cutback¶

/home/ze/drills/cutover-to-dc.sh ```

Drill 3 — PostgreSQL failover (any app's PG)¶

RPO target: seconds · RTO target: ~5 min

Demonstrated here for Keycloak's dedicated PG; the steps are identical for any per-app PG.

Pre-checks¶

```bash

On DC PG: write a sentinel row¶

ssh ze@keycloak-pg-vm1-dc sudo -u postgres psql -d keycloak -c \ "INSERT INTO public.dr_test (ts) VALUES (NOW());"

On DR PG: verify replication caught up (<1 sec)¶

ssh ze@keycloak-pg-vm1-dr sudo -u postgres psql -d keycloak -c \ "SELECT * FROM public.dr_test ORDER BY ts DESC LIMIT 1;" ```

Failure simulation¶

bash ssh ze@keycloak-pg-vm1-dc "sudo systemctl stop postgresql"

Recovery¶

Promote DR to primary: bash ssh ze@keycloak-pg-vm1-dr "sudo -u postgres pg_ctlcluster 16 main promote" Verify: bash sudo -u postgres psql -c "SELECT pg_is_in_recovery();" # Expected: f (false)
Update Keycloak app to point at DR PG: Via AWX: run job template keycloak-switch-db with var DB_HOST=keycloak-pg-vm1-dr. Or manually edit /etc/keycloak/keycloak.conf + restart Keycloak on both DC + DR Keycloak VMs.
Verify Keycloak login still works: bash curl -s https://keycloak.brac-poc.comptech-lab.com/realms/brac-poc # Should return JSON for the realm

Cleanup¶

On old-DC PG (now stopped): re-init as a streaming replica of DR: bash sudo -u postgres pg_basebackup -h keycloak-pg-vm1-dr -D /var/lib/postgresql/16/main \ -U replicator -P -R sudo systemctl start postgresql
When ready for cutback: reverse the promotion by stopping DR-primary, promoting DC-replica, reconnect app.

Drill 4 — Redis failover¶

RPO target: seconds · RTO target: ~2 min

Pre-checks¶

```bash

Write sentinel key to master¶

redis-cli -h redis.apps.brac-poc.comptech-lab.com -a $PASS SET dr-test "pre-failover" ```

Failure simulation¶

```bash

Stop Redis master (assume redis-vm1-dc is currently master)¶

ssh ze@redis-vm1-dc "sudo systemctl stop redis-server redis-sentinel" ```

Recovery¶

Sentinel quorum (5 sentinels across DC + DR sites) detects loss, elects new master automatically.

Observe: ```bash

Via remaining sentinel¶

redis-cli -h redis-vm1-dr -p 26379 SENTINEL master mymaster

Should show new master IP (a DR node)¶

```

Verify from app side: ```bash redis-cli -h redis.apps.brac-poc.comptech-lab.com -a $PASS GET dr-test

Returns "pre-failover" — data preserved; connection automatically routed¶

```

Cleanup¶

```bash

Restart DC master; it auto-rejoins as a replica of the new master¶

ssh ze@redis-vm1-dc "sudo systemctl start redis-server redis-sentinel"

Optionally failback (not necessary)¶

redis-cli -h redis-vm1-dr -p 26379 SENTINEL failover mymaster ```

Drill 5 — Hub failover (hub-dc → hub-dr)¶

RPO target: ≤15 min · RTO target: ~30 min

Pre-checks¶

```bash

Verify latest ACM backup exists¶

oc -n open-cluster-management-backup get backup | tail -5

Verify backups are in MinIO¶

mc ls minio/acm-hub-backup/ | tail -5 ```

Failure simulation¶

Shut down hub-dc's SNO node (simulate site outage): bash ssh ze@hub-dc-node "sudo shutdown -h now"

(Alternatively: block hub-dc's API/ingress VIPs at HAProxy to simulate network partition.)

Recovery¶

Verify hub-dc is truly down: bash curl -kI https://api.hub-dc.opp.brac-poc.comptech-lab.com:6443 # Should timeout
Log in to hub-dr: oc login api.hub-dr.opp.brac-poc.comptech-lab.com:6443
Restore ACM from backup: yaml # Apply Restore CR apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Restore metadata: name: restore-acm-hub-dr namespace: open-cluster-management-backup spec: veleroManagedClustersBackupName: latest veleroCredentialsBackupName: latest veleroResourcesBackupName: latest
Wait for ManagedClusters to re-register: bash oc get managedcluster # Spokes may show "Unknown" initially; bootstrap kubeconfig rotation fixes it
Re-bootstrap klusterlet on spokes (one-time, scripted): bash # For each spoke, extract new import.yaml from hub-dr and apply: oc get secret -n <cluster-name> import-yaml -o jsonpath='{.data.import\.yaml}' \ | base64 -d | oc --context=<spoke> apply -f -
Verify end-to-end:
ACM UI on hub-dr shows all 3 spokes Healthy
ApplicationSets continue syncing (ArgoCD on hub-dr was part of the restore)
Policy framework compliance re-reporting

Cleanup (cutback after hub-dc repair)¶

This is non-trivial — hub-dr has been accumulating state. Options: - A) Promote hub-dr permanently — hub-dc becomes the new DR when it comes back, backup/restore direction flips. Simpler. - B) Backup hub-dr → restore on hub-dc → demote hub-dr. Traditional, harder to script.

For the POC we demo option A.

Drill 6 — Spoke failover (spoke-dc → spoke-dr)¶

RPO target: workloads' own RPO (deps on PG/Redis above) · RTO target: ~5 min

Pre-checks¶

ArgoCD ApplicationSets show both spoke-dc and spoke-dr Healthy
Workloads are deployed on both (run oc get pods -A on each)

Failure simulation¶

```bash

Take all 3 spoke-dc nodes offline¶

for n in 1 2 3; do ssh ze@spoke-dc-node${n} "sudo shutdown -h now"; done ```

Recovery¶

Automatic: HAProxy health-checks mark spoke-dc API and ingress backends as down → traffic flips to spoke-dr backends. End-user URLs unchanged.

Observe: ```bash

Route traffic¶

curl -sI https://console-openshift-console.routes.spoke-dc.opp.brac-poc.comptech-lab.com/

May 5xx for a moment, then resolve to spoke-dr's cert¶

Check ArgoCD status — should show DC pods unhealthy, DR continues¶

oc --context=hub-dc get application -A ```

Cleanup¶

```bash

Bring spoke-dc back online¶

Nodes PXE/reboot back to Ready¶

ArgoCD self-heals → workloads return to Running on spoke-dc¶

ApplicationSet targets both spokes → no manual intervention¶

```

Drill 7 — Full site failover (integrated)¶

Goal: demonstrate end-to-end DC-outage → DR operation → full application access. Combines drills 0–6.

Sequence¶

Announce + snapshot all affected VMs
Drill 0: pre-flight (verify MinIO, Vault, PG all healthy at DR)
Simulate DC outage: shut down all DC-named VMs + OCP nodes at DC site
Drill 1: MinIO cutover (most urgent — everything depends on it)
Drill 2: Vault restore + unseal on DR
Drill 3: PG failovers (Keycloak PG, WSO2 APIM PG, WSO2 IS PG, GitLab PG, Temporal PG, n8n PG, AWX PG, Terrakube PG)
Drill 4: Redis Sentinel auto-failover
Drill 5: Hub OCP failover (ACM restore on hub-dr)
Drill 6: Spoke workload cutover (automatic via HAProxy)
Verify user journeys:
- Log in to Keycloak → DR
- Browse GitLab → DR
- Hit OCP console for spoke-dr
- View SigNoz dashboard → traces from spoke-dr
- Run a Temporal workflow
- Trigger a Jenkins build (if Jenkins-dr has config)

Target total RTO¶

Individual drills: ~5–30 min each
Sequential full failover: ~60–90 min
Parallelized (multiple operators, drills in parallel): ~30–40 min

Demo posture for BRAC¶

For the Day-6 demo: rehearse the full drill during Days 4-5, record a 10-15 min video. Demo day: show the pre-recorded drill + narrate, take Q&A. (Doing the full drill live on demo day is too risky; record beforehand and show.)

Smoke drills (automated, run daily)¶

Small scripted checks on ops-runner that prove DR endpoints are alive without actually failing over. Runs via cron + AWX job.

Check	How	Alert if fails
`minio.apps.brac-poc…` resolves to DC	`dig +short`	PagerDuty
MinIO-DR health	`mc admin info minio-dr`	PagerDuty
Vault-DR unsealed + replicated	`vault status` + list policies	PagerDuty
DR PG streaming lag	query `pg_stat_replication`	PagerDuty if lag >60s
ACM cluster-backup recent	list Velero backups, check timestamp	PagerDuty if gap >30 min
MinIO replication healthy	`mc admin replicate status`	PagerDuty

Roles and responsibilities¶

Role	During drill	Handle
Project Lead	Coordinates, announces	Timer + narrative
Infrastructure Lead	Shut-down commands, OCP steps	hub/spoke drills
Platform Lead	Vault, MinIO, Redis, SigNoz	tier-specific drills
DevOps Lead	GitLab, Jenkins, Nexus, AWX	CI/CD tier drills
Security Lead	Policy compliance re-check, audit log of drill	evidence package

After each drill, compile a short report (timings, issues, fixes) — add to reports/dr-drills/YYYY-MM-DD.md in the repo.

Open items to execute¶

Scripts on ops-runner under /home/ze/drills/ — written in the execution session that provisions ops-runner
Vault snapshot cron — in the Vault Ansible role
ACM BackupSchedule CR — in the GitOps policies/ folder
PostgreSQL streaming replication — in the PG Ansible role (per-app)
MinIO site replication setup — in the MinIO Ansible role

Created: 2026-04-24 · Owner: Infrastructure + Platform Leads · Status: playbook drafted, scripts pending execution session