Risk Register & Pre-Mortem Analysis¶
Identifies risks, likelihood, impact, and mitigation strategies for the 6-day aggressive timeline.
Risk Assessment Matrix¶
Impact
5 │ CRITICAL: Stop work, escalate
4 │ HIGH: Major timeline impact
3 │ MEDIUM: Recoverable
2 │ LOW: Minor impact
1 │ NEGLIGIBLE: Proceed as planned
└─────────────────────────────
1 2 3 4 5 (Likelihood)
Identified Risks¶
TIER 1: CRITICAL (Must Monitor)¶
Risk 1.1: OpenShift Cluster Provisioning Fails (Phase 1 Blocker)¶
| Attribute | Value |
|---|---|
| Likelihood | 2 (Low) |
| Impact | 5 (Critical) |
| Risk Score | 10 (CRITICAL) |
| Category | Infrastructure |
| Blocks | Everything (Phase 2, 3 dependent) |
Description: Terraform automation fails to provision 3-node cluster. Could be VM quota issues, network misconfiguration, or IaaS API changes.
Pre-Mortem Questions: - What if we don't have enough vCPU quota in the cloud account? - What if networking is misconfigured for cluster nodes? - What if the pull-secret is invalid/expired?
Mitigation Strategies (Priority Order): 1. Preparation (NOW): - Verify cloud account has 12+ vCPU quota available before starting - Test Terraform modules locally against test environment - Have backup VM provisioning script (manual setup) ready - Validate pull-secret is current and correct
- During Phase 1:
- Start OpenShift provisioning on Day 0 (tonight if possible)
- Monitor Terraform logs in real-time
-
Have alternate cloud account prepared
-
If it fails:
- Switch to manual VM provisioning + OpenShift installer (adds 2-4 hours)
- Fall back to existing OpenShift cluster if available
- Escalate to BRAC Bank for environment support
Escalation Trigger: If provisioning not complete by end of Day 1 morning
Contingency Plan:
- Manual alternative: Use openshift-install with pre-provisioned VMs (documented in DEPLOYMENT.md)
- Estimated recovery time: 4 hours
- Still have 2 days for Phase 2-3 work
Risk 1.2: Observability Pipeline Data Loss (Phase 2 Critical Path)¶
| Attribute | Value |
|---|---|
| Likelihood | 3 (Moderate) |
| Impact | 4 (High) |
| Risk Score | 12 (CRITICAL) |
| Category | Architecture |
| Blocks | SigNoz validation, demo credibility |
Description: OTel collector drops traces, Kafka fills up, or ClickHouse doesn't persist data—breaking the observability demo.
Pre-Mortem Questions: - What if OTel Collector memory runs out? - What if Kafka topic gets full and drops messages? - What if ClickHouse cold archiving breaks mid-POC?
Mitigation Strategies: 1. Design: - Set Kafka retention to 7 days (not 24h) to avoid early drops - Configure OTel batch processor with large queue size (1000 max batches) - Pre-stage ClickHouse with correct DDL, test inserts - Set resource limits: OTel Collector: 1GB memory, 2 CPU (add more if needed)
- Validation:
- Load test: Send 100k spans in 5 minutes, verify all reach SigNoz
- Kafka topic monitoring: Alert if lag > 100k messages
-
ClickHouse space: Ensure 50GB free before starting
-
If data loss detected:
- Re-run sample app with fresh traffic
- Restart observability stack (clears in-memory buffers)
- Skip cold archiving demo if ClickHouse fails
Escalation Trigger: If more than 10% of traces lost after 1 hour of traffic
Risk 1.3: Kubernetes Resource Exhaustion (Phase 2-3)¶
| Attribute | Value |
|---|---|
| Likelihood | 2 (Low) |
| Impact | 4 (High) |
| Risk Score | 8 (CRITICAL) |
| Category | Infrastructure |
| Blocks | Multiple component deployments |
Description: 9 major components + testing consume too much cluster memory/CPU → pod evictions → demo failures.
Pre-Mortem Questions: - What if SigNoz + ClickHouse need 20GB RAM total? - What if all Kafka brokers + observability pods = 100% CPU? - What if storage fills during the POC?
Mitigation Strategies: 1. Upfront Planning: - Resource estimate per component (document in DEPLOYMENT.md): - OpenShift nodes: 3 × (8 vCPU, 32 GB) = baseline - Kafka: 3 brokers × 2GB = 6GB - SigNoz: 8GB - WSO2 APIM: 4GB - Redis: 2GB - Total: ~22GB, 12+ vCPU needed
- Monitoring:
- Daily cluster health check:
kubectl top nodes,kubectl top pods - Alert if any node > 80% memory
-
Set pod resource requests/limits strictly
-
If exhausted:
- Evict non-critical components (Nexus, Trivy can be skipped)
- Reduce pod replicas: 2 → 1 where safe
- Disable unused services
Escalation Trigger: If nodes hitting 90% memory or any pod OOMKilled
TIER 2: HIGH (Monitor Weekly)¶
Risk 2.1: WSO2 Configuration Complexity¶
| Attribute | Value |
|---|---|
| Likelihood | 4 (High) |
| Impact | 2 (Low) |
| Risk Score | 8 (HIGH) |
| Category | Integration |
| Blocks | SSO demo |
Description: WSO2 SAML/OIDC configuration is notoriously finicky. Small misconfigurations break SSO.
Mitigation: - Use WSO2 Helm chart defaults (tested, known to work) - Pre-generate sample SAML assertion for testing - Have fallback demo: "API key authentication" if SSO fails
Contingency: Skip SSO demo, focus on APIM rate limiting instead (still valuable)
Risk 2.2: Terraform State Conflicts (Phase 1-2)¶
| Attribute | Value |
|---|---|
| Likelihood | 2 (Low) |
| Impact | 3 (Medium) |
| Risk Score | 6 (HIGH) |
| Category | Process |
| Blocks | Re-deployment |
Description: If state file gets corrupted or out-of-sync, reprovisioning becomes painful.
Mitigation:
- Use remote state backend (Terraform Cloud or S3 with locking)
- Version terraform code in Git, never commit .tfstate
- Daily state backup: terraform state pull > backup-$(date).json
- Only one person runs Terraform apply at a time
Contingency: Manual terraform import of existing resources if needed
Risk 2.3: Kafka Schema Registry Validation Breaks¶
| Attribute | Value |
|---|---|
| Likelihood | 3 (Moderate) |
| Impact | 2 (Low) |
| Risk Score | 6 (HIGH) |
| Category | Configuration |
| Blocks | Data pipeline demo |
Description: If schema validation is too strict, test traffic is rejected.
Mitigation: - Start with schema validation OFF (demo the feature separately) - Test schema registration before enabling validation - Have sample payloads pre-validated
TIER 3: MEDIUM (Monitor, Plan Fallbacks)¶
Risk 3.1: JBoss Domain Mode Learning Curve¶
- Likelihood: 4 | Impact: 1 | Score: 4
- Fallback: Deploy standalone JBoss instead of domain mode (simpler, still demonstrates app server)
Risk 3.2: Trivy Dashboard Performance¶
- Likelihood: 2 | Impact: 2 | Score: 4
- Fallback: Use Trivy CLI for scan results, skip dashboard if slow
Risk 3.3: ArgoCD GitOps Sync Delays¶
- Likelihood: 2 | Impact: 2 | Score: 4
- Fallback: Manual
kubectl applyif GitOps is too slow for demo
Risk 3.4: Demo Recording Failures¶
- Likelihood: 3 | Impact: 1 | Score: 3
- Fallback: Live demo instead of pre-recorded video
Timeline Risk Analysis¶
Critical Path (Days 1-3 → Phase 1 → Blocks Everything)¶
``` Day 1: Start → OpenShift Provisioning (8h) + GitLab HA (3h, parallel) + Kafka (2h, parallel) + Redis (1h, parallel) END: All foundation infra ready by EOD Day 1
Day 2: Compliance Scan (1h) OTel Observability (4h, CRITICAL) WSO2 APIM (3h) END: Observability pipeline validated by EOD Day 2
Day 3: Middleware (2h) ArgoCD (2h) Buffer (4h) END: Phase 2 complete, ready for Phase 3
Days 5-6: Trivy, Nexus, JBoss, Validation, Demo ```
Slack Time Analysis: - Day 1: 6 hours slack (if infra done faster) - Day 2: 2 hours slack (observability is critical) - Day 3: 4 hours slack (good buffer) - Days 5-6: 1 hour total slack (tight)
Risk: Days 5-6 have almost NO slack. If Phase 3 components slip, no recovery time.
Mitigation: - Start Phase 3 components on Day 4 evening if Phase 2 ahead of schedule - Trivy & Nexus are independent—parallelize them - Pre-stage JBoss manifests now, just deploy on Day 6
Pre-Mortem: "It's Day 7, POC Failed. What Happened?"¶
Likely Failure Scenarios (in order of probability):
- OpenShift Didn't Provision (30% probability)
- Root cause: Terraform errors, cloud quota, network issues
-
Prevention: Test on Day 0, have manual fallback ready
-
Observability Lost Traces (25% probability)
- Root cause: Kafka filled, OTel collector crashed, ClickHouse issues
-
Prevention: Load test early, monitor resource usage constantly
-
Resource Exhaustion Mid-Demo (20% probability)
- Root cause: 9 components fighting for memory on 3 small nodes
-
Prevention: Right-size nodes upfront, monitor daily
-
WSO2 SSO Never Worked (15% probability)
- Root cause: Configuration complexity, SAML metadata issues
-
Prevention: Use tested Helm values, test early
-
Time Overrun (Phase 3 Incomplete) (10% probability)
- Root cause: Underestimated Phase 2 complexity, no buffer
- Prevention: Aggressive Phase 1-2 timeline, parallelize Phase 3
Escalation Matrix¶
| Scenario | Owner | Timeline | Action |
|---|---|---|---|
| OpenShift fails | Infrastructure Lead | Day 1, 2pm | Try manual install, or use backup account |
| Observability data loss | Platform Lead | Day 2, ongoing | Reduce traffic load, restart stack, skip demo |
| Resource exhaustion | Cluster Admin | Daily | Reduce replicas, evict non-critical pods |
| WSO2 SSO broken | Integration Lead | Day 4 | Skip SSO, focus on API gateway |
| Behind timeline | Project Lead | Day 4 | Skip Phase 3 optional components |
| Demo request changes | Project Lead | Day 5 | Scope lock, communicate with BRAC |
Risk Monitoring Checklist¶
Daily:
- [ ] Cluster health: kubectl get nodes, all Ready?
- [ ] Pod status: Any CrashLoopBackOff or Pending?
- [ ] Resource usage: Any node > 80% memory?
- [ ] Terraform state: Clean, no conflicts?
Per Phase: - [ ] Phase 1 complete: All infra deployed, validated - [ ] Phase 2 complete: Observability pipeline flowing, traces visible - [ ] Phase 3 complete: All components accessible, ready for demo
Weekly (or as needed): - [ ] Update this risk register - [ ] Review timeline vs. actual progress - [ ] Identify new risks - [ ] Escalate if needed
Decision Log¶
| Date | Risk | Decision | Owner |
|---|---|---|---|
| 2026-04-24 | OpenShift | Start provisioning on Day 0 (tonight) | Infra Lead |
| 2026-04-24 | Observability | Load test before demo | Platform Lead |
| 2026-04-24 | Phase 3 | Parallelize components to recover time | Project Lead |
How to Use This Document¶
- Before Each Phase: Review relevant risks
- Daily Standup: Check monitoring checklist
- Issue Found: Update risk status, escalate if needed
- Phase Complete: Validate mitigation strategies worked
- Lessons Learned: Update for future POCs
Risk Register Created: 2026-04-24
Next Review: EOD Day 1 (after Phase 1 starts)
Owner: Project Lead
Status: Active (6-day POC in progress)