DC/DR Strategy¶
Disaster-recovery approach for every tier of the BRAC POC. Every component has a documented replication mode, RPO/RTO target, failover trigger, and recovery sequence.
Posture: active-passive (DR is hot-standby). On DC failure, a one-click (or short scripted) failover brings DR to primary. After DC recovery, a failback returns normal operation.
Glossary¶
| Term | Meaning |
|---|---|
| RPO (Recovery Point Objective) | How much data loss is acceptable; "we lose at most N minutes of data" |
| RTO (Recovery Time Objective) | How quickly service must be restored; "we are back up within N minutes" |
| Active-passive | DR stands idle (or read-only); fails over on DC loss |
| Async replication | Data replicated with lag (seconds to minutes) |
| Sync replication | Data committed only when written to both sites (adds latency, guarantees no loss) |
Target SLAs per tier¶
| Tier | RPO | RTO | Mode |
|---|---|---|---|
| Secrets (Vault) | ≤15 min | ≤10 min | Async Raft snapshot + restore |
| Object storage (MinIO) | seconds | near-zero | Async bucket replication |
| Cache (Redis) | seconds | ~2 min | Async replication + manual promote |
| PostgreSQL (all dedicated PGs) | seconds | ~5 min | Streaming WAL replication |
| OpenShift hub clusters | ~15 min | ~30 min | ACM cluster-backup to MinIO |
| OpenShift spoke clusters | seconds (app state via Git) | ~5 min | ArgoCD AppSet targeting both spokes |
| ClickHouse (SigNoz backend) | seconds | near-zero | Native ClickHouse replication |
| GitLab CE | ~1 hour | ~30 min | Nightly backup to MinIO + restore |
| Jenkins | ~1 hour | ~15 min | Backup JENKINS_HOME to MinIO |
| Nexus | ~1 hour | ~15 min | Backup Nexus data dir to MinIO |
| WSO2 APIM (all profiles) | seconds | ~10 min | Shared PG replication + WSO2 restart against DR DB |
| WSO2 IS | seconds | ~10 min | PG replication; session state lost (re-login) |
| Temporal | seconds | ~5 min | PG replication; in-flight workflows may be lost |
| n8n | seconds | ~5 min | PG replication |
| Splunk | ~1 day | ~15 min | Nightly backup (2-day retention anyway) |
| SigNoz UI | — | seconds | Stateless; restart on DR |
| AWX / Terrakube | ~1 hour | ~15 min | PG replication + backup |
Per-component strategy¶
1. Vault (secrets + PKI intermediate CA)¶
Replication: Vault OSS doesn't ship with cross-site DR (that's Enterprise). We use Raft snapshot + restore scheduled every 15 minutes.
```bash
Cron on vault-vm1-dc (leader) — every 15 min:¶
vault operator raft snapshot save - | mc pipe minio/vault-snapshots/$(date +%s).snap ```
- On DC failure: a technician retrieves the latest snapshot from MinIO (still accessible — MinIO itself has cross-site replication), restores it on
vault-vm1-dr:bash vault operator raft snapshot restore /tmp/latest.snap # Then unseal with the 5-of-5 keys - Unseal keys: 5-of-5 Shamir shares; custodians distributed so no single party can unseal DC or DR alone.
- PKI implication: the intermediate CA material is in Vault's storage — restoring the snapshot brings the intermediate CA back up. Issued certs in rotation continue to validate.
Runbook: DR-DRILL-PLAYBOOK.md → Drill 2
2. MinIO (object storage)¶
Replication: MinIO's native site replication (async, bucket-level). All 3 DC nodes continuously replicate to all 3 DR nodes.
bash
mc admin replicate add minio-dc minio-dr
- Characteristics: async with sub-second lag; both sites always readable; writes land on DC, replicate to DR
- On DC failure: update DNS/HAProxy backend for
minio.apps.brac-poc.comptech-lab.comto point to DR nodes. Applications reading from MinIO (Vault snapshots, Splunk backups, Nexus blob store) continue transparently - Failback: after DC recovery, MinIO reverses direction (DR → DC) until caught up, then swap back to primary
Runbook: Drill 1
3. PostgreSQL (each dedicated per-app PG)¶
Replication: async streaming replication via pg_basebackup + continuous WAL shipping.
- DC is primary; DR is hot standby (read-only replica)
- WAL shipped every few seconds; replication lag typically <1 sec
- Managed via Ansible role
postgresql-streaming-replica
On DC failure:
```bash
On DR PG VM, promote to primary:¶
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
Update app config to point DB host → DR¶
For WSO2: ansible-playbook wso2-switch-db-to-dr.yml¶
```
- Data loss: seconds of unreplicated WAL (acceptable for our POC; production would use sync replication for tier-1 apps)
- Failback: re-initialise DC from DR as new primary via
pg_basebackup, then swap roles back
Runbook: Drill 3
4. Redis + Sentinel (cache)¶
Replication: cross-site async replication. Master on redis-vm1-dc; replicas on redis-vm2-dc, redis-vm3-dc, redis-vm1-dr, redis-vm2-dr, redis-vm3-dr. Sentinel quorum (5 sentinels across both sites) monitors and fails over.
- Caveat: cross-site Sentinel failover works but can flap on network hiccups. For the POC we accept this; production usually uses either per-site Sentinel clusters + app-level cutover or Redis Enterprise with Active-Active.
On DC failure: Sentinel in DR detects lost quorum, elects a new master from DR replicas automatically. App connections reconnect via Sentinel client.
Runbook: Drill 4
5. OpenShift — hub clusters (hub-dc + hub-dr)¶
Replication: ACM's cluster-backup-chart (part of RHACM 2.16), which uses the OADP Operator (v1.5.5, Velero under the hood) as the backup engine. Runs scheduled backups of:
- ManagedCluster, Placement, Policy, GitOpsCluster, KlusterletAddonConfig CRs
- ACM MultiClusterHub state
- Secrets (ACM hub credentials, image-pull secrets)
- Backups sent to MinIO bucket acm-hub-backup/
yaml
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: BackupSchedule
metadata:
name: schedule-acm
namespace: open-cluster-management-backup
spec:
veleroSchedule: "*/15 * * * *" # every 15 min
veleroTtl: 72h
On hub-dc failure:
1. Restore ACM state from latest MinIO backup onto hub-dr using Restore CR
2. hub-dr's MultiClusterHub becomes primary; spokes re-register under hub-dr's klusterlet agent
3. ArgoCD continues syncing from Git — no workload disruption since spoke-dc/dr are independent
Runbook: Drill 5
6. OpenShift — spoke clusters (spoke-dc + spoke-dr)¶
Replication: workload deployment via ArgoCD ApplicationSet targets both spokes. Same Git commit lands the same manifests on spoke-dc and spoke-dr simultaneously.
- Stateless apps: run active-active on both spokes
- Stateful apps with per-cluster PV data (if any): see per-workload strategy
On spoke-dc failure: 1. HAProxy health checks mark spoke-dc's API/ingress as down 2. Traffic auto-shifts to spoke-dr (same wildcard DNS pattern keeps users on the same URL) 3. spoke-dr workloads (already running) pick up
Runbook: Drill 6
7. ClickHouse (SigNoz observability backend)¶
Replication: ClickHouse native ReplicatedMergeTree engine — 1 shard × 2 replicas (one per site). Coordinated via ClickHouse Keeper (ZooKeeper-compatible).
- Writes go to either replica; replicate to the other in near-real-time
- Reads load-balanced between both
- Survives single-replica loss transparently
On DC failure: DR replica serves reads + writes; SigNoz UI automatically reconnects (DNS failover).
8. GitLab CE¶
Replication: GitLab CE lacks built-in DR (that's Geo — Enterprise). We use nightly backup via gitlab-backup create to MinIO + hourly incremental Git repo mirror to DR site.
```bash
Cron on gitlab-vm1-dc — nightly:¶
gitlab-backup create SKIP=uploads STRATEGY=copy aws s3 cp /var/opt/gitlab/backups/latest.tar minio/gitlab-backup/ --endpoint=$MINIO_ENDPOINT ```
- On DC failure: restore the latest backup on
gitlab-vm1-dr:gitlab-backup restore BACKUP=latest - Data loss window: up to 24h (nightly backup) + commits since — accept for a POC
- Git repositories specifically are mirrored every 15 min via
push --mirrorso loss there is ≤15 min
9. Jenkins¶
Replication: backup JENKINS_HOME (job configs, credentials, build history) nightly to MinIO via tarball. Pipeline definitions live in Git so they're recoverable regardless.
On DC failure: restore JENKINS_HOME on jenkins-vm1-dr from MinIO; restart Jenkins; running builds lost (accept).
10. Nexus¶
Replication: Nexus backs up its data dir nightly to MinIO. Since Nexus's blob store itself is configured with an S3-backed MinIO backend (the MinIO bucket has cross-site replication), blobs are essentially free (replicated by MinIO).
Config/metadata (search index, components.db) = nightly tarball to MinIO.
On DC failure: stand up nexus-vm1-dr, restore metadata, point at replicated MinIO bucket — blobs immediately available.
11. WSO2 APIM (distributed)¶
Replication: - Shared APIM DB on PG replicates DC → DR via streaming replication (§3) - APIM config artifacts (API definitions, policies) backed up nightly to MinIO - Container images in Nexus, replicated via MinIO
On DC failure: WSO2 nodes at DR are already running; PG on DR promoted; WSO2 restarts against new PG endpoint; gateway routes via HAProxy switch to DR backends.
12. WSO2 IS¶
Same pattern as APIM (DB replication + restart against new endpoint). Session tokens are not replicated; users re-authenticate after failover (acceptable).
13. Temporal¶
Replication: Temporal stores workflow state in PG. PG streaming replication = Temporal DR.
- In-flight workflows: running workflows are in memory + workflow-task queue in DB; async replication means the last few milliseconds of state may be lost on failover
- Accepted trade-off for OSS; Temporal Enterprise has multi-cluster replication
14. n8n¶
PG replication + restart. n8n is mostly idle worker+scheduler, so failover is clean.
15. Splunk¶
Replication: Splunk Free doesn't support clustering or replication. Nightly splunkd index backup + config tarball → MinIO.
- Retention is 2 days anyway — a 1-day backup gap is within the retention window, so even full DC loss doesn't destroy logs older than what would already be deleted
16. SigNoz UI¶
Stateless. SigNoz-vm1-dr always running; on DC failure, DNS points to DR; ClickHouse (§7) provides the data.
17. AWX + Terrakube¶
PG replication (§3) + backup of workspace artifacts to MinIO. On failover, run the bootstrap Ansible on awx-vm1-dr to re-init against restored PG; Terrakube similarly.
Automation matrix per component¶
Many of our tools are OSS — some failovers will be automatic, some script-assisted, some manual with runbook. Every row has a documented way forward.
| Tier | Failover mode | Scripted fallback | Manual runbook |
|---|---|---|---|
| MinIO | 🟢 Automatic (site replication + HAProxy health check) | /home/ze/drills/minio-cutover.sh |
Drill 1 |
| Vault | 🟡 Semi-auto (snapshot cron → script restore) | vault-restore-snapshot.sh (pulls from MinIO, restores, unseals with piped shares) |
Drill 2 |
| PostgreSQL (per-app) | 🟢 Automatic if pg_auto_failover enabled; else 🟡 scripted |
pg-failover-$APP.sh (promote DR, reconfigure app via AWX) |
Drill 3 |
| Redis + Sentinel | 🟢 Fully automatic (Sentinel quorum) | — | Drill 4 |
| OCP hub (ACM) | 🟡 Script-assisted (Velero Restore CR apply) | hub-failover.sh (kicks Restore CR via oc apply) |
Drill 5 |
| OCP spoke | 🟢 Automatic (HAProxy + ApplicationSet on both spokes) | — | Drill 6 |
| ClickHouse | 🟢 Automatic (native replication) | — | — |
| GitLab CE | 🔴 Manual (backup restore) | gitlab-restore.sh (pulls backup from MinIO + gitlab-backup restore) |
inline in DC-DR-STRATEGY §8 |
| Jenkins | 🔴 Manual (JENKINS_HOME restore) | jenkins-restore.sh |
inline §9 |
| Nexus | 🟡 Semi-auto (blob store via replicated MinIO is automatic; metadata restore is scripted) | nexus-restore.sh |
inline §10 |
| WSO2 APIM / IS | 🟡 Script (PG promote + WSO2 restart against DR DB) | wso2-switch-to-dr.sh |
inline §11/§12 |
| Temporal | 🟡 Script (PG promote + Temporal restart) | temporal-failover.sh |
inline §13 |
| n8n | 🟡 Script (PG promote + n8n restart) | n8n-failover.sh |
inline §14 |
| Splunk | 🔴 Manual (nightly backup restore) | splunk-restore.sh |
inline §15 |
| AWX / Terrakube | 🔴 Manual (PG + artifact restore) | awx-restore.sh, terrakube-restore.sh |
inline §17 |
Legend: 🟢 automatic (no human needed) · 🟡 script-assisted (one command runs the drill) · 🔴 manual runbook (follow steps)
Principle: even "manual" entries have a script — we never rely on human-remembered command sequences. The "manual" label just means a human needs to decide to run the script.
Scripts directory¶
All scripts live in brac-poc-ansible/scripts/drills/, deployed by Ansible to /home/ze/drills/ on both ops-runner VMs. Each script:
- Is idempotent (safe to re-run)
- Logs to /var/log/drills/YYYY-MM-DD-<name>.log
- Emits start/end timestamps for RTO measurement
- Exits non-zero on failure so the runbook can branch
Cross-cutting concerns¶
DNS and HAProxy cutover¶
DR failover switches one HAProxy config (or a few DNS records) at the staxv edge. All consumer-facing URLs (*.apps.brac-poc.comptech-lab.com, *.routes.<cluster>.opp.brac-poc.comptech-lab.com) are wildcarded; flipping backends is:
- HAProxy ConfigMap edit: backend IPs swap from
-dcVMs to-drVMs - Reload-watcher sidecar picks up change in ~100ms
- PowerDNS records update (scripted) for names that point directly at VMs
Runbook: Drill 0 — network cutover
MinIO as the convergence point for backup¶
Almost every component ships its backup to MinIO. MinIO itself has cross-site replication, so backups are automatically available at DR. This means:
- Loss of DC MinIO → backups still exist at DR MinIO
- Loss of ONLY an app VM → restore from DC MinIO (fast)
- Loss of entire DC → restore each app from DR MinIO on DR VMs
This makes MinIO the single recovery dependency — MinIO comes up first on DR (it has its own cross-site replication), then everything else recovers from its buckets.
Clock sync¶
All VMs run chrony pointed at an internal NTP source (one of the ops-runners hosts it, stratum 3). Clock skew breaks TLS, Vault unseal, and Kerberos-ish WSO2 flows. Non-negotiable.
Test frequency¶
- Quarterly full DR drill (site switch) — demonstrated once during POC as proof
- Monthly component drills (drill one app end-to-end)
- Automated smoke drill nightly: scripted unreachable-test of DR endpoints, just to prove they're alive
Decision: multi-site vs single-site MinIO for backup targets¶
For a POC we use the same MinIO cluster pair for both backups and hot data (object storage = Velero target + Nexus blobs + Splunk buckets + Vault snapshots + ACM backups). This is simpler and realistic for smaller banks.
Production alternative: dedicate a separate "DR archive" MinIO tier in a third site. Out of scope for this POC; worth calling out to BRAC.
Created: 2026-04-24 · Owner: Infrastructure Lead · Status: ready for execution · Next: per-component DR runbooks in DR-DRILL-PLAYBOOK.md