Risk Register & Pre-Mortem Analysis¶

Identifies risks, likelihood, impact, and mitigation strategies for the 6-day aggressive timeline.

Risk Assessment Matrix¶

Impact 5 │ CRITICAL: Stop work, escalate 4 │ HIGH: Major timeline impact 3 │ MEDIUM: Recoverable 2 │ LOW: Minor impact 1 │ NEGLIGIBLE: Proceed as planned └───────────────────────────── 1 2 3 4 5 (Likelihood)

Identified Risks¶

TIER 1: CRITICAL (Must Monitor)¶

Risk 1.1: OpenShift Cluster Provisioning Fails (Phase 1 Blocker)¶

Attribute	Value
Likelihood	2 (Low)
Impact	5 (Critical)
Risk Score	10 (CRITICAL)
Category	Infrastructure
Blocks	Everything (Phase 2, 3 dependent)

Description: Terraform automation fails to provision 3-node cluster. Could be VM quota issues, network misconfiguration, or IaaS API changes.

Pre-Mortem Questions: - What if we don't have enough vCPU quota in the cloud account? - What if networking is misconfigured for cluster nodes? - What if the pull-secret is invalid/expired?

Mitigation Strategies (Priority Order): 1. Preparation (NOW): - Verify cloud account has 12+ vCPU quota available before starting - Test Terraform modules locally against test environment - Have backup VM provisioning script (manual setup) ready - Validate pull-secret is current and correct

During Phase 1:
Start OpenShift provisioning on Day 0 (tonight if possible)
Monitor Terraform logs in real-time
Have alternate cloud account prepared
If it fails:
Switch to manual VM provisioning + OpenShift installer (adds 2-4 hours)
Fall back to existing OpenShift cluster if available
Escalate to BRAC Bank for environment support

Escalation Trigger: If provisioning not complete by end of Day 1 morning

Contingency Plan: - Manual alternative: Use openshift-install with pre-provisioned VMs (documented in DEPLOYMENT.md) - Estimated recovery time: 4 hours - Still have 2 days for Phase 2-3 work

Risk 1.2: Observability Pipeline Data Loss (Phase 2 Critical Path)¶

Attribute	Value
Likelihood	3 (Moderate)
Impact	4 (High)
Risk Score	12 (CRITICAL)
Category	Architecture
Blocks	SigNoz validation, demo credibility

Description: OTel collector drops traces, Kafka fills up, or ClickHouse doesn't persist data—breaking the observability demo.

Pre-Mortem Questions: - What if OTel Collector memory runs out? - What if Kafka topic gets full and drops messages? - What if ClickHouse cold archiving breaks mid-POC?

Mitigation Strategies: 1. Design: - Set Kafka retention to 7 days (not 24h) to avoid early drops - Configure OTel batch processor with large queue size (1000 max batches) - Pre-stage ClickHouse with correct DDL, test inserts - Set resource limits: OTel Collector: 1GB memory, 2 CPU (add more if needed)

Validation:
Load test: Send 100k spans in 5 minutes, verify all reach SigNoz
Kafka topic monitoring: Alert if lag > 100k messages
ClickHouse space: Ensure 50GB free before starting
If data loss detected:
Re-run sample app with fresh traffic
Restart observability stack (clears in-memory buffers)
Skip cold archiving demo if ClickHouse fails

Escalation Trigger: If more than 10% of traces lost after 1 hour of traffic

Risk 1.3: Kubernetes Resource Exhaustion (Phase 2-3)¶

Attribute	Value
Likelihood	2 (Low)
Impact	4 (High)
Risk Score	8 (CRITICAL)
Category	Infrastructure
Blocks	Multiple component deployments

Description: 9 major components + testing consume too much cluster memory/CPU → pod evictions → demo failures.

Pre-Mortem Questions: - What if SigNoz + ClickHouse need 20GB RAM total? - What if all Kafka brokers + observability pods = 100% CPU? - What if storage fills during the POC?

Mitigation Strategies: 1. Upfront Planning: - Resource estimate per component (document in DEPLOYMENT.md): - OpenShift nodes: 3 × (8 vCPU, 32 GB) = baseline - Kafka: 3 brokers × 2GB = 6GB - SigNoz: 8GB - WSO2 APIM: 4GB - Redis: 2GB - Total: ~22GB, 12+ vCPU needed

Monitoring:
Daily cluster health check: kubectl top nodes, kubectl top pods
Alert if any node > 80% memory
Set pod resource requests/limits strictly
If exhausted:
Evict non-critical components (Nexus, Trivy can be skipped)
Reduce pod replicas: 2 → 1 where safe
Disable unused services

Escalation Trigger: If nodes hitting 90% memory or any pod OOMKilled

TIER 2: HIGH (Monitor Weekly)¶

Risk 2.1: WSO2 Configuration Complexity¶

Attribute	Value
Likelihood	4 (High)
Impact	2 (Low)
Risk Score	8 (HIGH)
Category	Integration
Blocks	SSO demo

Description: WSO2 SAML/OIDC configuration is notoriously finicky. Small misconfigurations break SSO.

Mitigation: - Use WSO2 Helm chart defaults (tested, known to work) - Pre-generate sample SAML assertion for testing - Have fallback demo: "API key authentication" if SSO fails

Contingency: Skip SSO demo, focus on APIM rate limiting instead (still valuable)

Risk 2.2: Terraform State Conflicts (Phase 1-2)¶

Attribute	Value
Likelihood	2 (Low)
Impact	3 (Medium)
Risk Score	6 (HIGH)
Category	Process
Blocks	Re-deployment

Description: If state file gets corrupted or out-of-sync, reprovisioning becomes painful.

Mitigation: - Use remote state backend (Terraform Cloud or S3 with locking) - Version terraform code in Git, never commit .tfstate - Daily state backup: terraform state pull > backup-$(date).json - Only one person runs Terraform apply at a time

Contingency: Manual terraform import of existing resources if needed

Risk 2.3: Kafka Schema Registry Validation Breaks¶

Attribute	Value
Likelihood	3 (Moderate)
Impact	2 (Low)
Risk Score	6 (HIGH)
Category	Configuration
Blocks	Data pipeline demo

Description: If schema validation is too strict, test traffic is rejected.

Mitigation: - Start with schema validation OFF (demo the feature separately) - Test schema registration before enabling validation - Have sample payloads pre-validated

TIER 3: MEDIUM (Monitor, Plan Fallbacks)¶

Risk 3.1: JBoss Domain Mode Learning Curve¶

Likelihood: 4 | Impact: 1 | Score: 4
Fallback: Deploy standalone JBoss instead of domain mode (simpler, still demonstrates app server)

Risk 3.2: Trivy Dashboard Performance¶

Likelihood: 2 | Impact: 2 | Score: 4
Fallback: Use Trivy CLI for scan results, skip dashboard if slow

Risk 3.3: ArgoCD GitOps Sync Delays¶

Likelihood: 2 | Impact: 2 | Score: 4
Fallback: Manual kubectl apply if GitOps is too slow for demo

Risk 3.4: Demo Recording Failures¶

Likelihood: 3 | Impact: 1 | Score: 3
Fallback: Live demo instead of pre-recorded video

Timeline Risk Analysis¶

Critical Path (Days 1-3 → Phase 1 → Blocks Everything)¶

``` Day 1: Start → OpenShift Provisioning (8h) + GitLab HA (3h, parallel) + Kafka (2h, parallel) + Redis (1h, parallel) END: All foundation infra ready by EOD Day 1

Day 2: Compliance Scan (1h) OTel Observability (4h, CRITICAL) WSO2 APIM (3h) END: Observability pipeline validated by EOD Day 2

Day 3: Middleware (2h) ArgoCD (2h) Buffer (4h) END: Phase 2 complete, ready for Phase 3

Days 5-6: Trivy, Nexus, JBoss, Validation, Demo ```

Slack Time Analysis: - Day 1: 6 hours slack (if infra done faster) - Day 2: 2 hours slack (observability is critical) - Day 3: 4 hours slack (good buffer) - Days 5-6: 1 hour total slack (tight)

Risk: Days 5-6 have almost NO slack. If Phase 3 components slip, no recovery time.

Mitigation: - Start Phase 3 components on Day 4 evening if Phase 2 ahead of schedule - Trivy & Nexus are independent—parallelize them - Pre-stage JBoss manifests now, just deploy on Day 6

Pre-Mortem: "It's Day 7, POC Failed. What Happened?"¶

Likely Failure Scenarios (in order of probability):

OpenShift Didn't Provision (30% probability)
Root cause: Terraform errors, cloud quota, network issues
Prevention: Test on Day 0, have manual fallback ready
Observability Lost Traces (25% probability)
Root cause: Kafka filled, OTel collector crashed, ClickHouse issues
Prevention: Load test early, monitor resource usage constantly
Resource Exhaustion Mid-Demo (20% probability)
Root cause: 9 components fighting for memory on 3 small nodes
Prevention: Right-size nodes upfront, monitor daily
WSO2 SSO Never Worked (15% probability)
Root cause: Configuration complexity, SAML metadata issues
Prevention: Use tested Helm values, test early
Time Overrun (Phase 3 Incomplete) (10% probability)
Root cause: Underestimated Phase 2 complexity, no buffer
Prevention: Aggressive Phase 1-2 timeline, parallelize Phase 3

Escalation Matrix¶

Scenario	Owner	Timeline	Action
OpenShift fails	Infrastructure Lead	Day 1, 2pm	Try manual install, or use backup account
Observability data loss	Platform Lead	Day 2, ongoing	Reduce traffic load, restart stack, skip demo
Resource exhaustion	Cluster Admin	Daily	Reduce replicas, evict non-critical pods
WSO2 SSO broken	Integration Lead	Day 4	Skip SSO, focus on API gateway
Behind timeline	Project Lead	Day 4	Skip Phase 3 optional components
Demo request changes	Project Lead	Day 5	Scope lock, communicate with BRAC

Risk Monitoring Checklist¶

Daily: - [ ] Cluster health: kubectl get nodes, all Ready? - [ ] Pod status: Any CrashLoopBackOff or Pending? - [ ] Resource usage: Any node > 80% memory? - [ ] Terraform state: Clean, no conflicts?

Per Phase: - [ ] Phase 1 complete: All infra deployed, validated - [ ] Phase 2 complete: Observability pipeline flowing, traces visible - [ ] Phase 3 complete: All components accessible, ready for demo

Weekly (or as needed): - [ ] Update this risk register - [ ] Review timeline vs. actual progress - [ ] Identify new risks - [ ] Escalate if needed

Decision Log¶

Date	Risk	Decision	Owner
2026-04-24	OpenShift	Start provisioning on Day 0 (tonight)	Infra Lead
2026-04-24	Observability	Load test before demo	Platform Lead
2026-04-24	Phase 3	Parallelize components to recover time	Project Lead

How to Use This Document¶

Before Each Phase: Review relevant risks
Daily Standup: Check monitoring checklist
Issue Found: Update risk status, escalate if needed
Phase Complete: Validate mitigation strategies worked
Lessons Learned: Update for future POCs

Risk Register Created: 2026-04-24
Next Review: EOD Day 1 (after Phase 1 starts)
Owner: Project Lead
Status: Active (6-day POC in progress)