DC/DR Strategy¶

Disaster-recovery approach for every tier of the BRAC POC. Every component has a documented replication mode, RPO/RTO target, failover trigger, and recovery sequence.

Posture: active-passive (DR is hot-standby). On DC failure, a one-click (or short scripted) failover brings DR to primary. After DC recovery, a failback returns normal operation.

Glossary¶

Term	Meaning
RPO (Recovery Point Objective)	How much data loss is acceptable; "we lose at most N minutes of data"
RTO (Recovery Time Objective)	How quickly service must be restored; "we are back up within N minutes"
Active-passive	DR stands idle (or read-only); fails over on DC loss
Async replication	Data replicated with lag (seconds to minutes)
Sync replication	Data committed only when written to both sites (adds latency, guarantees no loss)

Target SLAs per tier¶

Tier	RPO	RTO	Mode
Secrets (Vault)	≤15 min	≤10 min	Async Raft snapshot + restore
Object storage (MinIO)	seconds	near-zero	Async bucket replication
Cache (Redis)	seconds	~2 min	Async replication + manual promote
PostgreSQL (all dedicated PGs)	seconds	~5 min	Streaming WAL replication
OpenShift hub clusters	~15 min	~30 min	ACM cluster-backup to MinIO
OpenShift spoke clusters	seconds (app state via Git)	~5 min	ArgoCD AppSet targeting both spokes
ClickHouse (SigNoz backend)	seconds	near-zero	Native ClickHouse replication
GitLab CE	~1 hour	~30 min	Nightly backup to MinIO + restore
Jenkins	~1 hour	~15 min	Backup `JENKINS_HOME` to MinIO
Nexus	~1 hour	~15 min	Backup Nexus data dir to MinIO
WSO2 APIM (all profiles)	seconds	~10 min	Shared PG replication + WSO2 restart against DR DB
WSO2 IS	seconds	~10 min	PG replication; session state lost (re-login)
Temporal	seconds	~5 min	PG replication; in-flight workflows may be lost
n8n	seconds	~5 min	PG replication
Splunk	~1 day	~15 min	Nightly backup (2-day retention anyway)
SigNoz UI	—	seconds	Stateless; restart on DR
AWX / Terrakube	~1 hour	~15 min	PG replication + backup

Per-component strategy¶

1. Vault (secrets + PKI intermediate CA)¶

Replication: Vault OSS doesn't ship with cross-site DR (that's Enterprise). We use Raft snapshot + restore scheduled every 15 minutes.

```bash

Cron on vault-vm1-dc (leader) — every 15 min:¶

vault operator raft snapshot save - | mc pipe minio/vault-snapshots/$(date +%s).snap ```

On DC failure: a technician retrieves the latest snapshot from MinIO (still accessible — MinIO itself has cross-site replication), restores it on vault-vm1-dr: bash vault operator raft snapshot restore /tmp/latest.snap # Then unseal with the 5-of-5 keys
Unseal keys: 5-of-5 Shamir shares; custodians distributed so no single party can unseal DC or DR alone.
PKI implication: the intermediate CA material is in Vault's storage — restoring the snapshot brings the intermediate CA back up. Issued certs in rotation continue to validate.

Runbook: DR-DRILL-PLAYBOOK.md → Drill 2

2. MinIO (object storage)¶

Replication: MinIO's native site replication (async, bucket-level). All 3 DC nodes continuously replicate to all 3 DR nodes.

bash mc admin replicate add minio-dc minio-dr

Characteristics: async with sub-second lag; both sites always readable; writes land on DC, replicate to DR
On DC failure: update DNS/HAProxy backend for minio.apps.brac-poc.comptech-lab.com to point to DR nodes. Applications reading from MinIO (Vault snapshots, Splunk backups, Nexus blob store) continue transparently
Failback: after DC recovery, MinIO reverses direction (DR → DC) until caught up, then swap back to primary

Runbook: Drill 1

3. PostgreSQL (each dedicated per-app PG)¶

Replication: async streaming replication via pg_basebackup + continuous WAL shipping.

DC is primary; DR is hot standby (read-only replica)
WAL shipped every few seconds; replication lag typically <1 sec
Managed via Ansible role postgresql-streaming-replica

On DC failure:

```bash

On DR PG VM, promote to primary:¶

sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main

Update app config to point DB host → DR¶

For WSO2: ansible-playbook wso2-switch-db-to-dr.yml¶

```

Data loss: seconds of unreplicated WAL (acceptable for our POC; production would use sync replication for tier-1 apps)
Failback: re-initialise DC from DR as new primary via pg_basebackup, then swap roles back

Runbook: Drill 3

4. Redis + Sentinel (cache)¶

Replication: cross-site async replication. Master on redis-vm1-dc; replicas on redis-vm2-dc, redis-vm3-dc, redis-vm1-dr, redis-vm2-dr, redis-vm3-dr. Sentinel quorum (5 sentinels across both sites) monitors and fails over.

Caveat: cross-site Sentinel failover works but can flap on network hiccups. For the POC we accept this; production usually uses either per-site Sentinel clusters + app-level cutover or Redis Enterprise with Active-Active.

On DC failure: Sentinel in DR detects lost quorum, elects a new master from DR replicas automatically. App connections reconnect via Sentinel client.

Runbook: Drill 4

5. OpenShift — hub clusters (hub-dc + hub-dr)¶

Replication: ACM's cluster-backup-chart (part of RHACM 2.16), which uses the OADP Operator (v1.5.5, Velero under the hood) as the backup engine. Runs scheduled backups of: - ManagedCluster, Placement, Policy, GitOpsCluster, KlusterletAddonConfig CRs - ACM MultiClusterHub state - Secrets (ACM hub credentials, image-pull secrets) - Backups sent to MinIO bucket acm-hub-backup/

yaml apiVersion: cluster.open-cluster-management.io/v1beta1 kind: BackupSchedule metadata: name: schedule-acm namespace: open-cluster-management-backup spec: veleroSchedule: "*/15 * * * *" # every 15 min veleroTtl: 72h

On hub-dc failure: 1. Restore ACM state from latest MinIO backup onto hub-dr using Restore CR 2. hub-dr's MultiClusterHub becomes primary; spokes re-register under hub-dr's klusterlet agent 3. ArgoCD continues syncing from Git — no workload disruption since spoke-dc/dr are independent

Runbook: Drill 5

6. OpenShift — spoke clusters (spoke-dc + spoke-dr)¶

Replication: workload deployment via ArgoCD ApplicationSet targets both spokes. Same Git commit lands the same manifests on spoke-dc and spoke-dr simultaneously.

Stateless apps: run active-active on both spokes
Stateful apps with per-cluster PV data (if any): see per-workload strategy

On spoke-dc failure: 1. HAProxy health checks mark spoke-dc's API/ingress as down 2. Traffic auto-shifts to spoke-dr (same wildcard DNS pattern keeps users on the same URL) 3. spoke-dr workloads (already running) pick up

Runbook: Drill 6

7. ClickHouse (SigNoz observability backend)¶

Replication: ClickHouse native ReplicatedMergeTree engine — 1 shard × 2 replicas (one per site). Coordinated via ClickHouse Keeper (ZooKeeper-compatible).

Writes go to either replica; replicate to the other in near-real-time
Reads load-balanced between both
Survives single-replica loss transparently

On DC failure: DR replica serves reads + writes; SigNoz UI automatically reconnects (DNS failover).

8. GitLab CE¶

Replication: GitLab CE lacks built-in DR (that's Geo — Enterprise). We use nightly backup via gitlab-backup create to MinIO + hourly incremental Git repo mirror to DR site.

```bash

Cron on gitlab-vm1-dc — nightly:¶

gitlab-backup create SKIP=uploads STRATEGY=copy aws s3 cp /var/opt/gitlab/backups/latest.tar minio/gitlab-backup/ --endpoint=$MINIO_ENDPOINT ```

On DC failure: restore the latest backup on gitlab-vm1-dr: gitlab-backup restore BACKUP=latest
Data loss window: up to 24h (nightly backup) + commits since — accept for a POC
Git repositories specifically are mirrored every 15 min via push --mirror so loss there is ≤15 min

9. Jenkins¶

Replication: backup JENKINS_HOME (job configs, credentials, build history) nightly to MinIO via tarball. Pipeline definitions live in Git so they're recoverable regardless.

On DC failure: restore JENKINS_HOME on jenkins-vm1-dr from MinIO; restart Jenkins; running builds lost (accept).

10. Nexus¶

Replication: Nexus backs up its data dir nightly to MinIO. Since Nexus's blob store itself is configured with an S3-backed MinIO backend (the MinIO bucket has cross-site replication), blobs are essentially free (replicated by MinIO).

Config/metadata (search index, components.db) = nightly tarball to MinIO.

On DC failure: stand up nexus-vm1-dr, restore metadata, point at replicated MinIO bucket — blobs immediately available.

11. WSO2 APIM (distributed)¶

Replication: - Shared APIM DB on PG replicates DC → DR via streaming replication (§3) - APIM config artifacts (API definitions, policies) backed up nightly to MinIO - Container images in Nexus, replicated via MinIO

On DC failure: WSO2 nodes at DR are already running; PG on DR promoted; WSO2 restarts against new PG endpoint; gateway routes via HAProxy switch to DR backends.

12. WSO2 IS¶

Same pattern as APIM (DB replication + restart against new endpoint). Session tokens are not replicated; users re-authenticate after failover (acceptable).

13. Temporal¶

Replication: Temporal stores workflow state in PG. PG streaming replication = Temporal DR.

In-flight workflows: running workflows are in memory + workflow-task queue in DB; async replication means the last few milliseconds of state may be lost on failover
Accepted trade-off for OSS; Temporal Enterprise has multi-cluster replication

14. n8n¶

PG replication + restart. n8n is mostly idle worker+scheduler, so failover is clean.

15. Splunk¶

Replication: Splunk Free doesn't support clustering or replication. Nightly splunkd index backup + config tarball → MinIO.

Retention is 2 days anyway — a 1-day backup gap is within the retention window, so even full DC loss doesn't destroy logs older than what would already be deleted

16. SigNoz UI¶

Stateless. SigNoz-vm1-dr always running; on DC failure, DNS points to DR; ClickHouse (§7) provides the data.

17. AWX + Terrakube¶

PG replication (§3) + backup of workspace artifacts to MinIO. On failover, run the bootstrap Ansible on awx-vm1-dr to re-init against restored PG; Terrakube similarly.

Automation matrix per component¶

Many of our tools are OSS — some failovers will be automatic, some script-assisted, some manual with runbook. Every row has a documented way forward.

Tier	Failover mode	Scripted fallback	Manual runbook
MinIO	🟢 Automatic (site replication + HAProxy health check)	`/home/ze/drills/minio-cutover.sh`	Drill 1
Vault	🟡 Semi-auto (snapshot cron → script restore)	`vault-restore-snapshot.sh` (pulls from MinIO, restores, unseals with piped shares)	Drill 2
PostgreSQL (per-app)	🟢 Automatic if `pg_auto_failover` enabled; else 🟡 scripted	`pg-failover-$APP.sh` (promote DR, reconfigure app via AWX)	Drill 3
Redis + Sentinel	🟢 Fully automatic (Sentinel quorum)	—	Drill 4
OCP hub (ACM)	🟡 Script-assisted (Velero Restore CR apply)	`hub-failover.sh` (kicks `Restore` CR via `oc apply`)	Drill 5
OCP spoke	🟢 Automatic (HAProxy + ApplicationSet on both spokes)	—	Drill 6
ClickHouse	🟢 Automatic (native replication)	—	—
GitLab CE	🔴 Manual (backup restore)	`gitlab-restore.sh` (pulls backup from MinIO + `gitlab-backup restore`)	inline in DC-DR-STRATEGY §8
Jenkins	🔴 Manual (JENKINS_HOME restore)	`jenkins-restore.sh`	inline §9
Nexus	🟡 Semi-auto (blob store via replicated MinIO is automatic; metadata restore is scripted)	`nexus-restore.sh`	inline §10
WSO2 APIM / IS	🟡 Script (PG promote + WSO2 restart against DR DB)	`wso2-switch-to-dr.sh`	inline §11/§12
Temporal	🟡 Script (PG promote + Temporal restart)	`temporal-failover.sh`	inline §13
n8n	🟡 Script (PG promote + n8n restart)	`n8n-failover.sh`	inline §14
Splunk	🔴 Manual (nightly backup restore)	`splunk-restore.sh`	inline §15
AWX / Terrakube	🔴 Manual (PG + artifact restore)	`awx-restore.sh`, `terrakube-restore.sh`	inline §17

Legend: 🟢 automatic (no human needed) · 🟡 script-assisted (one command runs the drill) · 🔴 manual runbook (follow steps)

Principle: even "manual" entries have a script — we never rely on human-remembered command sequences. The "manual" label just means a human needs to decide to run the script.

Scripts directory¶

All scripts live in brac-poc-ansible/scripts/drills/, deployed by Ansible to /home/ze/drills/ on both ops-runner VMs. Each script: - Is idempotent (safe to re-run) - Logs to /var/log/drills/YYYY-MM-DD-<name>.log - Emits start/end timestamps for RTO measurement - Exits non-zero on failure so the runbook can branch

Cross-cutting concerns¶

DNS and HAProxy cutover¶

DR failover switches one HAProxy config (or a few DNS records) at the staxv edge. All consumer-facing URLs (*.apps.brac-poc.comptech-lab.com, *.routes.<cluster>.opp.brac-poc.comptech-lab.com) are wildcarded; flipping backends is:

HAProxy ConfigMap edit: backend IPs swap from -dc VMs to -dr VMs
Reload-watcher sidecar picks up change in ~100ms
PowerDNS records update (scripted) for names that point directly at VMs

Runbook: Drill 0 — network cutover

MinIO as the convergence point for backup¶

Almost every component ships its backup to MinIO. MinIO itself has cross-site replication, so backups are automatically available at DR. This means:

Loss of DC MinIO → backups still exist at DR MinIO
Loss of ONLY an app VM → restore from DC MinIO (fast)
Loss of entire DC → restore each app from DR MinIO on DR VMs

This makes MinIO the single recovery dependency — MinIO comes up first on DR (it has its own cross-site replication), then everything else recovers from its buckets.

Clock sync¶

All VMs run chrony pointed at an internal NTP source (one of the ops-runners hosts it, stratum 3). Clock skew breaks TLS, Vault unseal, and Kerberos-ish WSO2 flows. Non-negotiable.

Test frequency¶

Quarterly full DR drill (site switch) — demonstrated once during POC as proof
Monthly component drills (drill one app end-to-end)
Automated smoke drill nightly: scripted unreachable-test of DR endpoints, just to prove they're alive

Decision: multi-site vs single-site MinIO for backup targets¶

For a POC we use the same MinIO cluster pair for both backups and hot data (object storage = Velero target + Nexus blobs + Splunk buckets + Vault snapshots + ACM backups). This is simpler and realistic for smaller banks.

Production alternative: dedicate a separate "DR archive" MinIO tier in a third site. Out of scope for this POC; worth calling out to BRAC.

Created: 2026-04-24 · Owner: Infrastructure Lead · Status: ready for execution · Next: per-component DR runbooks in DR-DRILL-PLAYBOOK.md