DR Drill Playbook¶
Step-by-step runbooks for each disaster-recovery drill. Every drill has: goal, pre-checks, failure-simulation steps, recovery steps, post-verification, and cleanup (return to DC-primary).
Before running any drill
- All drills are destructive to test. Run in a maintenance window or on synthetic workload.
- Announce on the team channel first.
- Take a pre-drill snapshot of every affected VM via KubeVirt (
VirtualMachineSnapshotCR). - Open a time-tracking log: record every step's timestamp for RTO measurement.
Drill 0 — network cutover (the glue)¶
Every other drill ends up doing this. Separate runbook because it's small.
Goal: switch HAProxy backends + PowerDNS records so consumer URLs resolve to DR VMs.
Steps (~1 minute):
-
On ops-runner, run:
bash ssh ze@brac-poc-ops-runner-vm1-dc \ "/home/ze/drills/cutover-to-dr.sh"Script renders a Jinja template → newhaproxy.cfgConfigMap → commits tostaxv-monogitops → reload-watcher picks up in ~100ms. -
Verify:
bash curl -sI https://minio.apps.brac-poc.comptech-lab.com/ | grep -i server # Should show DR backend header -
Similarly update PowerDNS records that point directly at VMs (names like
vault-dc.opp.brac-poc.comptech-lab.com→ swap tovault-drIPs). Script:/home/ze/drills/pdns-switch-to-dr.sh
Cutback: run cutover-to-dc.sh when DC is healthy again.
Drill 1 — MinIO DR¶
RPO target: seconds · RTO target: near-zero
Pre-checks¶
```bash
Upload test object to DC¶
mc cp test-file.txt minio-dc/test-bucket/
Verify replicated to DR within 10 sec¶
sleep 10 && mc ls minio-dr/test-bucket/ | grep test-file.txt ```
Failure simulation¶
```bash
Stop MinIO service on all 3 DC nodes (simulate site outage)¶
for n in 1 2 3; do ssh ze@minio-vm${n}-dc "sudo systemctl stop minio" done ```
Recovery (network cutover only — MinIO DR is hot)¶
- Run Drill 0 network cutover
- Verify from external:
bash mc alias set mc-poc https://minio.apps.brac-poc.comptech-lab.com \ $ACCESS_KEY $SECRET_KEY mc ls mc-poc/test-bucket/ # Should list the test file
Post-verification¶
- Upload a new object after failover → verify it lands on DR
- Vault backup upload still works (Vault cron writes to
minio.apps...→ goes to DR) - Nexus blob reads still work (Nexus is configured via same MinIO hostname)
Cleanup (return to DC)¶
```bash
Restart DC MinIO¶
for n in 1 2 3; do ssh ze@minio-vm${n}-dc "sudo systemctl start minio"; done
Wait for replication to catch up (DR → DC now)¶
mc admin replicate status minio-dr
Wait for "Healing" → "Healthy"¶
Network cutback¶
/home/ze/drills/cutover-to-dc.sh ```
Expected timings: cutover <2 min; full resync depends on delta.
Drill 2 — Vault DR¶
RPO target: ≤15 min · RTO target: ≤10 min
Pre-checks¶
```bash
Verify latest snapshot exists at MinIO¶
mc ls minio/vault-snapshots/ | tail -5
Should show recent timestamps <15 min old¶
```
Failure simulation¶
```bash
Stop all 3 DC Vault nodes¶
for n in 1 2 3; do ssh ze@vault-vm${n}-dc "sudo systemctl stop vault" done ```
Recovery¶
-
Fetch latest snapshot on DR leader:
bash ssh ze@vault-vm1-dr mc cp minio/vault-snapshots/$(mc ls minio/vault-snapshots/ | tail -1 | awk '{print $NF}') \ /tmp/latest.snap -
Init DR cluster + restore snapshot:
bash # If this is first-time init, initialize with 5 unseal keys (hand back to custodians) # If previously initialized, just restore: vault operator raft snapshot restore /tmp/latest.snap -
Unseal (need 3 of 5 Shamir shares):
bash vault operator unseal <share-1> vault operator unseal <share-2> vault operator unseal <share-3> -
Verify:
bash vault status # should show sealed=false, standby=false on leader vault kv get secret/test # should read the test path vault read pki_int/cert/ca # intermediate CA should be back -
Repeat unseal on
vault-vm2-drandvault-vm3-dr— they auto-join the DR Raft cluster as followers. -
Network cutover (Drill 0) — cert-manager + External Secrets Operator now hit DR Vault.
Post-verification¶
- External Secrets Operator reconciles an
ExternalSecret→ newSecretmaterialises → workload pod reads it: proves the whole pipeline - Issue a test leaf cert from the intermediate CA — should succeed
Data-loss assessment¶
- Any secret written in the last ≤15 min of DC life is lost (still in DC Vault's Raft state that never snapshotted)
- If this is unacceptable, shrink the snapshot interval to 5 min
Cleanup¶
Once DC comes back up: ```bash
On DC nodes: destroy Raft state (it's stale) and rejoin DR cluster as followers¶
ssh ze@vault-vm1-dc "sudo systemctl stop vault; sudo rm -rf /var/lib/vault/raft/*"
Then re-init via DR leader:¶
ssh ze@vault-vm1-dr "vault operator raft join https://vault-vm1-dr:8201"
Unseal DC node¶
Repeat for vm2, vm3¶
Cutback¶
/home/ze/drills/cutover-to-dc.sh ```
Drill 3 — PostgreSQL failover (any app's PG)¶
RPO target: seconds · RTO target: ~5 min
Demonstrated here for Keycloak's dedicated PG; the steps are identical for any per-app PG.
Pre-checks¶
```bash
On DC PG: write a sentinel row¶
ssh ze@keycloak-pg-vm1-dc sudo -u postgres psql -d keycloak -c \ "INSERT INTO public.dr_test (ts) VALUES (NOW());"
On DR PG: verify replication caught up (<1 sec)¶
ssh ze@keycloak-pg-vm1-dr sudo -u postgres psql -d keycloak -c \ "SELECT * FROM public.dr_test ORDER BY ts DESC LIMIT 1;" ```
Failure simulation¶
bash
ssh ze@keycloak-pg-vm1-dc "sudo systemctl stop postgresql"
Recovery¶
-
Promote DR to primary:
bash ssh ze@keycloak-pg-vm1-dr "sudo -u postgres pg_ctlcluster 16 main promote"Verify:bash sudo -u postgres psql -c "SELECT pg_is_in_recovery();" # Expected: f (false) -
Update Keycloak app to point at DR PG: Via AWX: run job template
keycloak-switch-dbwith varDB_HOST=keycloak-pg-vm1-dr. Or manually edit/etc/keycloak/keycloak.conf+ restart Keycloak on both DC + DR Keycloak VMs. -
Verify Keycloak login still works:
bash curl -s https://keycloak.brac-poc.comptech-lab.com/realms/brac-poc # Should return JSON for the realm
Cleanup¶
- On old-DC PG (now stopped): re-init as a streaming replica of DR:
bash sudo -u postgres pg_basebackup -h keycloak-pg-vm1-dr -D /var/lib/postgresql/16/main \ -U replicator -P -R sudo systemctl start postgresql - When ready for cutback: reverse the promotion by stopping DR-primary, promoting DC-replica, reconnect app.
Drill 4 — Redis failover¶
RPO target: seconds · RTO target: ~2 min
Pre-checks¶
```bash
Write sentinel key to master¶
redis-cli -h redis.apps.brac-poc.comptech-lab.com -a $PASS SET dr-test "pre-failover" ```
Failure simulation¶
```bash
Stop Redis master (assume redis-vm1-dc is currently master)¶
ssh ze@redis-vm1-dc "sudo systemctl stop redis-server redis-sentinel" ```
Recovery¶
Sentinel quorum (5 sentinels across DC + DR sites) detects loss, elects new master automatically.
Observe: ```bash
Via remaining sentinel¶
redis-cli -h redis-vm1-dr -p 26379 SENTINEL master mymaster
Should show new master IP (a DR node)¶
```
Verify from app side: ```bash redis-cli -h redis.apps.brac-poc.comptech-lab.com -a $PASS GET dr-test
Returns "pre-failover" — data preserved; connection automatically routed¶
```
Cleanup¶
```bash
Restart DC master; it auto-rejoins as a replica of the new master¶
ssh ze@redis-vm1-dc "sudo systemctl start redis-server redis-sentinel"
Optionally failback (not necessary)¶
redis-cli -h redis-vm1-dr -p 26379 SENTINEL failover mymaster ```
Drill 5 — Hub failover (hub-dc → hub-dr)¶
RPO target: ≤15 min · RTO target: ~30 min
Pre-checks¶
```bash
Verify latest ACM backup exists¶
oc -n open-cluster-management-backup get backup | tail -5
Verify backups are in MinIO¶
mc ls minio/acm-hub-backup/ | tail -5 ```
Failure simulation¶
Shut down hub-dc's SNO node (simulate site outage):
bash
ssh ze@hub-dc-node "sudo shutdown -h now"
(Alternatively: block hub-dc's API/ingress VIPs at HAProxy to simulate network partition.)
Recovery¶
-
Verify hub-dc is truly down:
bash curl -kI https://api.hub-dc.opp.brac-poc.comptech-lab.com:6443 # Should timeout -
Log in to hub-dr:
oc login api.hub-dr.opp.brac-poc.comptech-lab.com:6443 -
Restore ACM from backup:
yaml # Apply Restore CR apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Restore metadata: name: restore-acm-hub-dr namespace: open-cluster-management-backup spec: veleroManagedClustersBackupName: latest veleroCredentialsBackupName: latest veleroResourcesBackupName: latest -
Wait for ManagedClusters to re-register:
bash oc get managedcluster # Spokes may show "Unknown" initially; bootstrap kubeconfig rotation fixes it -
Re-bootstrap klusterlet on spokes (one-time, scripted):
bash # For each spoke, extract new import.yaml from hub-dr and apply: oc get secret -n <cluster-name> import-yaml -o jsonpath='{.data.import\.yaml}' \ | base64 -d | oc --context=<spoke> apply -f - -
Verify end-to-end:
- ACM UI on hub-dr shows all 3 spokes Healthy
- ApplicationSets continue syncing (ArgoCD on hub-dr was part of the restore)
- Policy framework compliance re-reporting
Cleanup (cutback after hub-dc repair)¶
This is non-trivial — hub-dr has been accumulating state. Options: - A) Promote hub-dr permanently — hub-dc becomes the new DR when it comes back, backup/restore direction flips. Simpler. - B) Backup hub-dr → restore on hub-dc → demote hub-dr. Traditional, harder to script.
For the POC we demo option A.
Drill 6 — Spoke failover (spoke-dc → spoke-dr)¶
RPO target: workloads' own RPO (deps on PG/Redis above) · RTO target: ~5 min
Pre-checks¶
- ArgoCD ApplicationSets show both
spoke-dcandspoke-drHealthy - Workloads are deployed on both (run
oc get pods -Aon each)
Failure simulation¶
```bash
Take all 3 spoke-dc nodes offline¶
for n in 1 2 3; do ssh ze@spoke-dc-node${n} "sudo shutdown -h now"; done ```
Recovery¶
Automatic: HAProxy health-checks mark spoke-dc API and ingress backends as down → traffic flips to spoke-dr backends. End-user URLs unchanged.
Observe: ```bash
Route traffic¶
curl -sI https://console-openshift-console.routes.spoke-dc.opp.brac-poc.comptech-lab.com/
May 5xx for a moment, then resolve to spoke-dr's cert¶
Check ArgoCD status — should show DC pods unhealthy, DR continues¶
oc --context=hub-dc get application -A ```
Cleanup¶
```bash
Bring spoke-dc back online¶
Nodes PXE/reboot back to Ready¶
ArgoCD self-heals → workloads return to Running on spoke-dc¶
ApplicationSet targets both spokes → no manual intervention¶
```
Drill 7 — Full site failover (integrated)¶
Goal: demonstrate end-to-end DC-outage → DR operation → full application access. Combines drills 0–6.
Sequence¶
- Announce + snapshot all affected VMs
- Drill 0: pre-flight (verify MinIO, Vault, PG all healthy at DR)
- Simulate DC outage: shut down all DC-named VMs + OCP nodes at DC site
- Drill 1: MinIO cutover (most urgent — everything depends on it)
- Drill 2: Vault restore + unseal on DR
- Drill 3: PG failovers (Keycloak PG, WSO2 APIM PG, WSO2 IS PG, GitLab PG, Temporal PG, n8n PG, AWX PG, Terrakube PG)
- Drill 4: Redis Sentinel auto-failover
- Drill 5: Hub OCP failover (ACM restore on hub-dr)
- Drill 6: Spoke workload cutover (automatic via HAProxy)
- Verify user journeys:
- Log in to Keycloak → DR
- Browse GitLab → DR
- Hit OCP console for spoke-dr
- View SigNoz dashboard → traces from spoke-dr
- Run a Temporal workflow
- Trigger a Jenkins build (if Jenkins-dr has config)
Target total RTO¶
- Individual drills: ~5–30 min each
- Sequential full failover: ~60–90 min
- Parallelized (multiple operators, drills in parallel): ~30–40 min
Demo posture for BRAC¶
For the Day-6 demo: rehearse the full drill during Days 4-5, record a 10-15 min video. Demo day: show the pre-recorded drill + narrate, take Q&A. (Doing the full drill live on demo day is too risky; record beforehand and show.)
Smoke drills (automated, run daily)¶
Small scripted checks on ops-runner that prove DR endpoints are alive without actually failing over. Runs via cron + AWX job.
| Check | How | Alert if fails |
|---|---|---|
minio.apps.brac-poc… resolves to DC |
dig +short |
PagerDuty |
| MinIO-DR health | mc admin info minio-dr |
PagerDuty |
| Vault-DR unsealed + replicated | vault status + list policies |
PagerDuty |
| DR PG streaming lag | query pg_stat_replication |
PagerDuty if lag >60s |
| ACM cluster-backup recent | list Velero backups, check timestamp | PagerDuty if gap >30 min |
| MinIO replication healthy | mc admin replicate status |
PagerDuty |
Roles and responsibilities¶
| Role | During drill | Handle |
|---|---|---|
| Project Lead | Coordinates, announces | Timer + narrative |
| Infrastructure Lead | Shut-down commands, OCP steps | hub/spoke drills |
| Platform Lead | Vault, MinIO, Redis, SigNoz | tier-specific drills |
| DevOps Lead | GitLab, Jenkins, Nexus, AWX | CI/CD tier drills |
| Security Lead | Policy compliance re-check, audit log of drill | evidence package |
After each drill, compile a short report (timings, issues, fixes) — add to reports/dr-drills/YYYY-MM-DD.md in the repo.
Open items to execute¶
- Scripts on ops-runner under
/home/ze/drills/— written in the execution session that provisions ops-runner - Vault snapshot cron — in the Vault Ansible role
- ACM BackupSchedule CR — in the GitOps
policies/folder - PostgreSQL streaming replication — in the PG Ansible role (per-app)
- MinIO site replication setup — in the MinIO Ansible role
Created: 2026-04-24 · Owner: Infrastructure + Platform Leads · Status: playbook drafted, scripts pending execution session