Skip to content

DC/DR Strategy

Disaster-recovery approach for every tier of the BRAC POC. Every component has a documented replication mode, RPO/RTO target, failover trigger, and recovery sequence.

Posture: active-passive (DR is hot-standby). On DC failure, a one-click (or short scripted) failover brings DR to primary. After DC recovery, a failback returns normal operation.


Glossary

Term Meaning
RPO (Recovery Point Objective) How much data loss is acceptable; "we lose at most N minutes of data"
RTO (Recovery Time Objective) How quickly service must be restored; "we are back up within N minutes"
Active-passive DR stands idle (or read-only); fails over on DC loss
Async replication Data replicated with lag (seconds to minutes)
Sync replication Data committed only when written to both sites (adds latency, guarantees no loss)

Target SLAs per tier

Tier RPO RTO Mode
Secrets (Vault) ≤15 min ≤10 min Async Raft snapshot + restore
Object storage (MinIO) seconds near-zero Async bucket replication
Cache (Redis) seconds ~2 min Async replication + manual promote
PostgreSQL (all dedicated PGs) seconds ~5 min Streaming WAL replication
OpenShift hub clusters ~15 min ~30 min ACM cluster-backup to MinIO
OpenShift spoke clusters seconds (app state via Git) ~5 min ArgoCD AppSet targeting both spokes
ClickHouse (SigNoz backend) seconds near-zero Native ClickHouse replication
GitLab CE ~1 hour ~30 min Nightly backup to MinIO + restore
Jenkins ~1 hour ~15 min Backup JENKINS_HOME to MinIO
Nexus ~1 hour ~15 min Backup Nexus data dir to MinIO
WSO2 APIM (all profiles) seconds ~10 min Shared PG replication + WSO2 restart against DR DB
WSO2 IS seconds ~10 min PG replication; session state lost (re-login)
Temporal seconds ~5 min PG replication; in-flight workflows may be lost
n8n seconds ~5 min PG replication
Splunk ~1 day ~15 min Nightly backup (2-day retention anyway)
SigNoz UI seconds Stateless; restart on DR
AWX / Terrakube ~1 hour ~15 min PG replication + backup

Per-component strategy

1. Vault (secrets + PKI intermediate CA)

Replication: Vault OSS doesn't ship with cross-site DR (that's Enterprise). We use Raft snapshot + restore scheduled every 15 minutes.

```bash

Cron on vault-vm1-dc (leader) — every 15 min:

vault operator raft snapshot save - | mc pipe minio/vault-snapshots/$(date +%s).snap ```

  • On DC failure: a technician retrieves the latest snapshot from MinIO (still accessible — MinIO itself has cross-site replication), restores it on vault-vm1-dr: bash vault operator raft snapshot restore /tmp/latest.snap # Then unseal with the 5-of-5 keys
  • Unseal keys: 5-of-5 Shamir shares; custodians distributed so no single party can unseal DC or DR alone.
  • PKI implication: the intermediate CA material is in Vault's storage — restoring the snapshot brings the intermediate CA back up. Issued certs in rotation continue to validate.

Runbook: DR-DRILL-PLAYBOOK.md → Drill 2


2. MinIO (object storage)

Replication: MinIO's native site replication (async, bucket-level). All 3 DC nodes continuously replicate to all 3 DR nodes.

bash mc admin replicate add minio-dc minio-dr

  • Characteristics: async with sub-second lag; both sites always readable; writes land on DC, replicate to DR
  • On DC failure: update DNS/HAProxy backend for minio.apps.brac-poc.comptech-lab.com to point to DR nodes. Applications reading from MinIO (Vault snapshots, Splunk backups, Nexus blob store) continue transparently
  • Failback: after DC recovery, MinIO reverses direction (DR → DC) until caught up, then swap back to primary

Runbook: Drill 1


3. PostgreSQL (each dedicated per-app PG)

Replication: async streaming replication via pg_basebackup + continuous WAL shipping.

  • DC is primary; DR is hot standby (read-only replica)
  • WAL shipped every few seconds; replication lag typically <1 sec
  • Managed via Ansible role postgresql-streaming-replica

On DC failure:

```bash

On DR PG VM, promote to primary:

sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main

Update app config to point DB host → DR

For WSO2: ansible-playbook wso2-switch-db-to-dr.yml

```

  • Data loss: seconds of unreplicated WAL (acceptable for our POC; production would use sync replication for tier-1 apps)
  • Failback: re-initialise DC from DR as new primary via pg_basebackup, then swap roles back

Runbook: Drill 3


4. Redis + Sentinel (cache)

Replication: cross-site async replication. Master on redis-vm1-dc; replicas on redis-vm2-dc, redis-vm3-dc, redis-vm1-dr, redis-vm2-dr, redis-vm3-dr. Sentinel quorum (5 sentinels across both sites) monitors and fails over.

  • Caveat: cross-site Sentinel failover works but can flap on network hiccups. For the POC we accept this; production usually uses either per-site Sentinel clusters + app-level cutover or Redis Enterprise with Active-Active.

On DC failure: Sentinel in DR detects lost quorum, elects a new master from DR replicas automatically. App connections reconnect via Sentinel client.

Runbook: Drill 4


5. OpenShift — hub clusters (hub-dc + hub-dr)

Replication: ACM's cluster-backup-chart (part of RHACM 2.16), which uses the OADP Operator (v1.5.5, Velero under the hood) as the backup engine. Runs scheduled backups of: - ManagedCluster, Placement, Policy, GitOpsCluster, KlusterletAddonConfig CRs - ACM MultiClusterHub state - Secrets (ACM hub credentials, image-pull secrets) - Backups sent to MinIO bucket acm-hub-backup/

yaml apiVersion: cluster.open-cluster-management.io/v1beta1 kind: BackupSchedule metadata: name: schedule-acm namespace: open-cluster-management-backup spec: veleroSchedule: "*/15 * * * *" # every 15 min veleroTtl: 72h

On hub-dc failure: 1. Restore ACM state from latest MinIO backup onto hub-dr using Restore CR 2. hub-dr's MultiClusterHub becomes primary; spokes re-register under hub-dr's klusterlet agent 3. ArgoCD continues syncing from Git — no workload disruption since spoke-dc/dr are independent

Runbook: Drill 5


6. OpenShift — spoke clusters (spoke-dc + spoke-dr)

Replication: workload deployment via ArgoCD ApplicationSet targets both spokes. Same Git commit lands the same manifests on spoke-dc and spoke-dr simultaneously.

  • Stateless apps: run active-active on both spokes
  • Stateful apps with per-cluster PV data (if any): see per-workload strategy

On spoke-dc failure: 1. HAProxy health checks mark spoke-dc's API/ingress as down 2. Traffic auto-shifts to spoke-dr (same wildcard DNS pattern keeps users on the same URL) 3. spoke-dr workloads (already running) pick up

Runbook: Drill 6


7. ClickHouse (SigNoz observability backend)

Replication: ClickHouse native ReplicatedMergeTree engine — 1 shard × 2 replicas (one per site). Coordinated via ClickHouse Keeper (ZooKeeper-compatible).

  • Writes go to either replica; replicate to the other in near-real-time
  • Reads load-balanced between both
  • Survives single-replica loss transparently

On DC failure: DR replica serves reads + writes; SigNoz UI automatically reconnects (DNS failover).


8. GitLab CE

Replication: GitLab CE lacks built-in DR (that's Geo — Enterprise). We use nightly backup via gitlab-backup create to MinIO + hourly incremental Git repo mirror to DR site.

```bash

Cron on gitlab-vm1-dc — nightly:

gitlab-backup create SKIP=uploads STRATEGY=copy aws s3 cp /var/opt/gitlab/backups/latest.tar minio/gitlab-backup/ --endpoint=$MINIO_ENDPOINT ```

  • On DC failure: restore the latest backup on gitlab-vm1-dr: gitlab-backup restore BACKUP=latest
  • Data loss window: up to 24h (nightly backup) + commits since — accept for a POC
  • Git repositories specifically are mirrored every 15 min via push --mirror so loss there is ≤15 min

9. Jenkins

Replication: backup JENKINS_HOME (job configs, credentials, build history) nightly to MinIO via tarball. Pipeline definitions live in Git so they're recoverable regardless.

On DC failure: restore JENKINS_HOME on jenkins-vm1-dr from MinIO; restart Jenkins; running builds lost (accept).


10. Nexus

Replication: Nexus backs up its data dir nightly to MinIO. Since Nexus's blob store itself is configured with an S3-backed MinIO backend (the MinIO bucket has cross-site replication), blobs are essentially free (replicated by MinIO).

Config/metadata (search index, components.db) = nightly tarball to MinIO.

On DC failure: stand up nexus-vm1-dr, restore metadata, point at replicated MinIO bucket — blobs immediately available.


11. WSO2 APIM (distributed)

Replication: - Shared APIM DB on PG replicates DC → DR via streaming replication (§3) - APIM config artifacts (API definitions, policies) backed up nightly to MinIO - Container images in Nexus, replicated via MinIO

On DC failure: WSO2 nodes at DR are already running; PG on DR promoted; WSO2 restarts against new PG endpoint; gateway routes via HAProxy switch to DR backends.


12. WSO2 IS

Same pattern as APIM (DB replication + restart against new endpoint). Session tokens are not replicated; users re-authenticate after failover (acceptable).


13. Temporal

Replication: Temporal stores workflow state in PG. PG streaming replication = Temporal DR.

  • In-flight workflows: running workflows are in memory + workflow-task queue in DB; async replication means the last few milliseconds of state may be lost on failover
  • Accepted trade-off for OSS; Temporal Enterprise has multi-cluster replication

14. n8n

PG replication + restart. n8n is mostly idle worker+scheduler, so failover is clean.


15. Splunk

Replication: Splunk Free doesn't support clustering or replication. Nightly splunkd index backup + config tarball → MinIO.

  • Retention is 2 days anyway — a 1-day backup gap is within the retention window, so even full DC loss doesn't destroy logs older than what would already be deleted

16. SigNoz UI

Stateless. SigNoz-vm1-dr always running; on DC failure, DNS points to DR; ClickHouse (§7) provides the data.


17. AWX + Terrakube

PG replication (§3) + backup of workspace artifacts to MinIO. On failover, run the bootstrap Ansible on awx-vm1-dr to re-init against restored PG; Terrakube similarly.


Automation matrix per component

Many of our tools are OSS — some failovers will be automatic, some script-assisted, some manual with runbook. Every row has a documented way forward.

Tier Failover mode Scripted fallback Manual runbook
MinIO 🟢 Automatic (site replication + HAProxy health check) /home/ze/drills/minio-cutover.sh Drill 1
Vault 🟡 Semi-auto (snapshot cron → script restore) vault-restore-snapshot.sh (pulls from MinIO, restores, unseals with piped shares) Drill 2
PostgreSQL (per-app) 🟢 Automatic if pg_auto_failover enabled; else 🟡 scripted pg-failover-$APP.sh (promote DR, reconfigure app via AWX) Drill 3
Redis + Sentinel 🟢 Fully automatic (Sentinel quorum) Drill 4
OCP hub (ACM) 🟡 Script-assisted (Velero Restore CR apply) hub-failover.sh (kicks Restore CR via oc apply) Drill 5
OCP spoke 🟢 Automatic (HAProxy + ApplicationSet on both spokes) Drill 6
ClickHouse 🟢 Automatic (native replication)
GitLab CE 🔴 Manual (backup restore) gitlab-restore.sh (pulls backup from MinIO + gitlab-backup restore) inline in DC-DR-STRATEGY §8
Jenkins 🔴 Manual (JENKINS_HOME restore) jenkins-restore.sh inline §9
Nexus 🟡 Semi-auto (blob store via replicated MinIO is automatic; metadata restore is scripted) nexus-restore.sh inline §10
WSO2 APIM / IS 🟡 Script (PG promote + WSO2 restart against DR DB) wso2-switch-to-dr.sh inline §11/§12
Temporal 🟡 Script (PG promote + Temporal restart) temporal-failover.sh inline §13
n8n 🟡 Script (PG promote + n8n restart) n8n-failover.sh inline §14
Splunk 🔴 Manual (nightly backup restore) splunk-restore.sh inline §15
AWX / Terrakube 🔴 Manual (PG + artifact restore) awx-restore.sh, terrakube-restore.sh inline §17

Legend: 🟢 automatic (no human needed) · 🟡 script-assisted (one command runs the drill) · 🔴 manual runbook (follow steps)

Principle: even "manual" entries have a script — we never rely on human-remembered command sequences. The "manual" label just means a human needs to decide to run the script.

Scripts directory

All scripts live in brac-poc-ansible/scripts/drills/, deployed by Ansible to /home/ze/drills/ on both ops-runner VMs. Each script: - Is idempotent (safe to re-run) - Logs to /var/log/drills/YYYY-MM-DD-<name>.log - Emits start/end timestamps for RTO measurement - Exits non-zero on failure so the runbook can branch


Cross-cutting concerns

DNS and HAProxy cutover

DR failover switches one HAProxy config (or a few DNS records) at the staxv edge. All consumer-facing URLs (*.apps.brac-poc.comptech-lab.com, *.routes.<cluster>.opp.brac-poc.comptech-lab.com) are wildcarded; flipping backends is:

  • HAProxy ConfigMap edit: backend IPs swap from -dc VMs to -dr VMs
  • Reload-watcher sidecar picks up change in ~100ms
  • PowerDNS records update (scripted) for names that point directly at VMs

Runbook: Drill 0 — network cutover

MinIO as the convergence point for backup

Almost every component ships its backup to MinIO. MinIO itself has cross-site replication, so backups are automatically available at DR. This means:

  • Loss of DC MinIO → backups still exist at DR MinIO
  • Loss of ONLY an app VM → restore from DC MinIO (fast)
  • Loss of entire DC → restore each app from DR MinIO on DR VMs

This makes MinIO the single recovery dependency — MinIO comes up first on DR (it has its own cross-site replication), then everything else recovers from its buckets.

Clock sync

All VMs run chrony pointed at an internal NTP source (one of the ops-runners hosts it, stratum 3). Clock skew breaks TLS, Vault unseal, and Kerberos-ish WSO2 flows. Non-negotiable.

Test frequency

  • Quarterly full DR drill (site switch) — demonstrated once during POC as proof
  • Monthly component drills (drill one app end-to-end)
  • Automated smoke drill nightly: scripted unreachable-test of DR endpoints, just to prove they're alive

Decision: multi-site vs single-site MinIO for backup targets

For a POC we use the same MinIO cluster pair for both backups and hot data (object storage = Velero target + Nexus blobs + Splunk buckets + Vault snapshots + ACM backups). This is simpler and realistic for smaller banks.

Production alternative: dedicate a separate "DR archive" MinIO tier in a third site. Out of scope for this POC; worth calling out to BRAC.


Created: 2026-04-24 · Owner: Infrastructure Lead · Status: ready for execution · Next: per-component DR runbooks in DR-DRILL-PLAYBOOK.md