Skip to content

Risk Register & Pre-Mortem Analysis

Identifies risks, likelihood, impact, and mitigation strategies for the 6-day aggressive timeline.

Risk Assessment Matrix

Impact 5 │ CRITICAL: Stop work, escalate 4 │ HIGH: Major timeline impact 3 │ MEDIUM: Recoverable 2 │ LOW: Minor impact 1 │ NEGLIGIBLE: Proceed as planned └───────────────────────────── 1 2 3 4 5 (Likelihood)


Identified Risks

TIER 1: CRITICAL (Must Monitor)

Risk 1.1: OpenShift Cluster Provisioning Fails (Phase 1 Blocker)

Attribute Value
Likelihood 2 (Low)
Impact 5 (Critical)
Risk Score 10 (CRITICAL)
Category Infrastructure
Blocks Everything (Phase 2, 3 dependent)

Description: Terraform automation fails to provision 3-node cluster. Could be VM quota issues, network misconfiguration, or IaaS API changes.

Pre-Mortem Questions: - What if we don't have enough vCPU quota in the cloud account? - What if networking is misconfigured for cluster nodes? - What if the pull-secret is invalid/expired?

Mitigation Strategies (Priority Order): 1. Preparation (NOW): - Verify cloud account has 12+ vCPU quota available before starting - Test Terraform modules locally against test environment - Have backup VM provisioning script (manual setup) ready - Validate pull-secret is current and correct

  1. During Phase 1:
  2. Start OpenShift provisioning on Day 0 (tonight if possible)
  3. Monitor Terraform logs in real-time
  4. Have alternate cloud account prepared

  5. If it fails:

  6. Switch to manual VM provisioning + OpenShift installer (adds 2-4 hours)
  7. Fall back to existing OpenShift cluster if available
  8. Escalate to BRAC Bank for environment support

Escalation Trigger: If provisioning not complete by end of Day 1 morning

Contingency Plan: - Manual alternative: Use openshift-install with pre-provisioned VMs (documented in DEPLOYMENT.md) - Estimated recovery time: 4 hours - Still have 2 days for Phase 2-3 work


Risk 1.2: Observability Pipeline Data Loss (Phase 2 Critical Path)

Attribute Value
Likelihood 3 (Moderate)
Impact 4 (High)
Risk Score 12 (CRITICAL)
Category Architecture
Blocks SigNoz validation, demo credibility

Description: OTel collector drops traces, Kafka fills up, or ClickHouse doesn't persist data—breaking the observability demo.

Pre-Mortem Questions: - What if OTel Collector memory runs out? - What if Kafka topic gets full and drops messages? - What if ClickHouse cold archiving breaks mid-POC?

Mitigation Strategies: 1. Design: - Set Kafka retention to 7 days (not 24h) to avoid early drops - Configure OTel batch processor with large queue size (1000 max batches) - Pre-stage ClickHouse with correct DDL, test inserts - Set resource limits: OTel Collector: 1GB memory, 2 CPU (add more if needed)

  1. Validation:
  2. Load test: Send 100k spans in 5 minutes, verify all reach SigNoz
  3. Kafka topic monitoring: Alert if lag > 100k messages
  4. ClickHouse space: Ensure 50GB free before starting

  5. If data loss detected:

  6. Re-run sample app with fresh traffic
  7. Restart observability stack (clears in-memory buffers)
  8. Skip cold archiving demo if ClickHouse fails

Escalation Trigger: If more than 10% of traces lost after 1 hour of traffic


Risk 1.3: Kubernetes Resource Exhaustion (Phase 2-3)

Attribute Value
Likelihood 2 (Low)
Impact 4 (High)
Risk Score 8 (CRITICAL)
Category Infrastructure
Blocks Multiple component deployments

Description: 9 major components + testing consume too much cluster memory/CPU → pod evictions → demo failures.

Pre-Mortem Questions: - What if SigNoz + ClickHouse need 20GB RAM total? - What if all Kafka brokers + observability pods = 100% CPU? - What if storage fills during the POC?

Mitigation Strategies: 1. Upfront Planning: - Resource estimate per component (document in DEPLOYMENT.md): - OpenShift nodes: 3 × (8 vCPU, 32 GB) = baseline - Kafka: 3 brokers × 2GB = 6GB - SigNoz: 8GB - WSO2 APIM: 4GB - Redis: 2GB - Total: ~22GB, 12+ vCPU needed

  1. Monitoring:
  2. Daily cluster health check: kubectl top nodes, kubectl top pods
  3. Alert if any node > 80% memory
  4. Set pod resource requests/limits strictly

  5. If exhausted:

  6. Evict non-critical components (Nexus, Trivy can be skipped)
  7. Reduce pod replicas: 2 → 1 where safe
  8. Disable unused services

Escalation Trigger: If nodes hitting 90% memory or any pod OOMKilled


TIER 2: HIGH (Monitor Weekly)

Risk 2.1: WSO2 Configuration Complexity

Attribute Value
Likelihood 4 (High)
Impact 2 (Low)
Risk Score 8 (HIGH)
Category Integration
Blocks SSO demo

Description: WSO2 SAML/OIDC configuration is notoriously finicky. Small misconfigurations break SSO.

Mitigation: - Use WSO2 Helm chart defaults (tested, known to work) - Pre-generate sample SAML assertion for testing - Have fallback demo: "API key authentication" if SSO fails

Contingency: Skip SSO demo, focus on APIM rate limiting instead (still valuable)


Risk 2.2: Terraform State Conflicts (Phase 1-2)

Attribute Value
Likelihood 2 (Low)
Impact 3 (Medium)
Risk Score 6 (HIGH)
Category Process
Blocks Re-deployment

Description: If state file gets corrupted or out-of-sync, reprovisioning becomes painful.

Mitigation: - Use remote state backend (Terraform Cloud or S3 with locking) - Version terraform code in Git, never commit .tfstate - Daily state backup: terraform state pull > backup-$(date).json - Only one person runs Terraform apply at a time

Contingency: Manual terraform import of existing resources if needed


Risk 2.3: Kafka Schema Registry Validation Breaks

Attribute Value
Likelihood 3 (Moderate)
Impact 2 (Low)
Risk Score 6 (HIGH)
Category Configuration
Blocks Data pipeline demo

Description: If schema validation is too strict, test traffic is rejected.

Mitigation: - Start with schema validation OFF (demo the feature separately) - Test schema registration before enabling validation - Have sample payloads pre-validated


TIER 3: MEDIUM (Monitor, Plan Fallbacks)

Risk 3.1: JBoss Domain Mode Learning Curve

  • Likelihood: 4 | Impact: 1 | Score: 4
  • Fallback: Deploy standalone JBoss instead of domain mode (simpler, still demonstrates app server)

Risk 3.2: Trivy Dashboard Performance

  • Likelihood: 2 | Impact: 2 | Score: 4
  • Fallback: Use Trivy CLI for scan results, skip dashboard if slow

Risk 3.3: ArgoCD GitOps Sync Delays

  • Likelihood: 2 | Impact: 2 | Score: 4
  • Fallback: Manual kubectl apply if GitOps is too slow for demo

Risk 3.4: Demo Recording Failures

  • Likelihood: 3 | Impact: 1 | Score: 3
  • Fallback: Live demo instead of pre-recorded video

Timeline Risk Analysis

Critical Path (Days 1-3 → Phase 1 → Blocks Everything)

``` Day 1: Start → OpenShift Provisioning (8h) + GitLab HA (3h, parallel) + Kafka (2h, parallel) + Redis (1h, parallel) END: All foundation infra ready by EOD Day 1

Day 2: Compliance Scan (1h) OTel Observability (4h, CRITICAL) WSO2 APIM (3h) END: Observability pipeline validated by EOD Day 2

Day 3: Middleware (2h) ArgoCD (2h) Buffer (4h) END: Phase 2 complete, ready for Phase 3

Days 5-6: Trivy, Nexus, JBoss, Validation, Demo ```

Slack Time Analysis: - Day 1: 6 hours slack (if infra done faster) - Day 2: 2 hours slack (observability is critical) - Day 3: 4 hours slack (good buffer) - Days 5-6: 1 hour total slack (tight)

Risk: Days 5-6 have almost NO slack. If Phase 3 components slip, no recovery time.

Mitigation: - Start Phase 3 components on Day 4 evening if Phase 2 ahead of schedule - Trivy & Nexus are independent—parallelize them - Pre-stage JBoss manifests now, just deploy on Day 6


Pre-Mortem: "It's Day 7, POC Failed. What Happened?"

Likely Failure Scenarios (in order of probability):

  1. OpenShift Didn't Provision (30% probability)
  2. Root cause: Terraform errors, cloud quota, network issues
  3. Prevention: Test on Day 0, have manual fallback ready

  4. Observability Lost Traces (25% probability)

  5. Root cause: Kafka filled, OTel collector crashed, ClickHouse issues
  6. Prevention: Load test early, monitor resource usage constantly

  7. Resource Exhaustion Mid-Demo (20% probability)

  8. Root cause: 9 components fighting for memory on 3 small nodes
  9. Prevention: Right-size nodes upfront, monitor daily

  10. WSO2 SSO Never Worked (15% probability)

  11. Root cause: Configuration complexity, SAML metadata issues
  12. Prevention: Use tested Helm values, test early

  13. Time Overrun (Phase 3 Incomplete) (10% probability)

  14. Root cause: Underestimated Phase 2 complexity, no buffer
  15. Prevention: Aggressive Phase 1-2 timeline, parallelize Phase 3

Escalation Matrix

Scenario Owner Timeline Action
OpenShift fails Infrastructure Lead Day 1, 2pm Try manual install, or use backup account
Observability data loss Platform Lead Day 2, ongoing Reduce traffic load, restart stack, skip demo
Resource exhaustion Cluster Admin Daily Reduce replicas, evict non-critical pods
WSO2 SSO broken Integration Lead Day 4 Skip SSO, focus on API gateway
Behind timeline Project Lead Day 4 Skip Phase 3 optional components
Demo request changes Project Lead Day 5 Scope lock, communicate with BRAC

Risk Monitoring Checklist

Daily: - [ ] Cluster health: kubectl get nodes, all Ready? - [ ] Pod status: Any CrashLoopBackOff or Pending? - [ ] Resource usage: Any node > 80% memory? - [ ] Terraform state: Clean, no conflicts?

Per Phase: - [ ] Phase 1 complete: All infra deployed, validated - [ ] Phase 2 complete: Observability pipeline flowing, traces visible - [ ] Phase 3 complete: All components accessible, ready for demo

Weekly (or as needed): - [ ] Update this risk register - [ ] Review timeline vs. actual progress - [ ] Identify new risks - [ ] Escalate if needed


Decision Log

Date Risk Decision Owner
2026-04-24 OpenShift Start provisioning on Day 0 (tonight) Infra Lead
2026-04-24 Observability Load test before demo Platform Lead
2026-04-24 Phase 3 Parallelize components to recover time Project Lead

How to Use This Document

  1. Before Each Phase: Review relevant risks
  2. Daily Standup: Check monitoring checklist
  3. Issue Found: Update risk status, escalate if needed
  4. Phase Complete: Validate mitigation strategies worked
  5. Lessons Learned: Update for future POCs

Risk Register Created: 2026-04-24
Next Review: EOD Day 1 (after Phase 1 starts)
Owner: Project Lead
Status: Active (6-day POC in progress)