ADR 0001: Choose OpenTelemetry for Observability Over Proprietary APM¶
Status: Accepted
Date: 2026-04-24
Authors: BRAC POC Team
Context¶
The POC requires comprehensive observability covering logs, metrics, and traces across 9 infrastructure components. BRAC Bank has indicated no vendor lock-in preference in their RFP requirements. The solution must demonstrate: - Vendor-neutral instrumentation - Cost-effective scaling - Multi-layer aggregation (logs, metrics, traces) - Integration with multiple data backends
Decision¶
Adopt OpenTelemetry (OTel) as the observability standard with SigNoz as the backend visualization and storage layer. This includes: - OTel SDK for application instrumentation - OTel Collector for aggregation and pipeline management - Kafka as the buffer between collector and backend - SigNoz for unified dashboards - ClickHouse for long-term storage with retention policies
Consequences¶
Positive¶
- Vendor neutrality: OTel is CNCF-backed, not tied to any single vendor
- Future flexibility: Can swap backends (SigNoz → Datadog, Grafana Loki, etc.)
- Standard instrumentation: Developers learn one framework applicable across organizations
- Cost control: Self-managed backend eliminates per-seat licensing
- Multi-signal correlation: Single platform for logs, metrics, and traces reduces context-switching
- Scalability: Kafka buffer prevents collector overload during peaks
Negative¶
- Operational overhead: Must manage OTel Collector, ClickHouse, and SigNoz ourselves
- Learning curve: Team needs OTel SDK knowledge (though adoption is growing)
- Data consistency: Multi-signal correlation requires careful configuration
- Retention management: ClickHouse cold archiving needs manual tuning
Neutral¶
- OTel ecosystem still evolving (mature for traces, growing for logs/metrics)
- SigNoz is younger than enterprise APM vendors (Datadog, New Relic) but improving rapidly
Alternatives Considered¶
Alternative 1: Datadog APM¶
- Pros: Enterprise-grade, fully managed, excellent support
- Cons: Vendor lock-in, per-host licensing becomes expensive at scale, doesn't meet BRAC neutrality preference
- Why rejected: Contradicts BRAC's stated vendor-independence requirement
Alternative 2: Elastic Stack (ELK)¶
- Pros: Open source, good for logs, wide adoption
- Cons: Weak in distributed tracing, separate tools needed for metrics/APM, less unified experience
- Why rejected: Fragmented observability story for this POC's multi-signal needs
Alternative 3: Prometheus + Loki + Jaeger (Best of Breed)¶
- Pros: All open source, CNCF projects, mature
- Cons: Three separate systems = three dashboards, higher operational complexity, no native correlation
- Why rejected: Inferior UX for unified observability; SigNoz provides better integration
Implementation Notes¶
OTel Configuration¶
- Collectors: Deployed as DaemonSets on each OpenShift node
- Sampling: Trace sampling set to 10% for performance (configurable per service)
- Exporters: OTLP → Kafka → SigNoz ClickHouse
- Resource attributes: Include cluster, namespace, pod, service name for correlation
SigNoz Backend¶
- Deployment: Helm chart on OpenShift
- Data retention: 2-day hot retention in ClickHouse, cold archive to S3-compatible storage (MinIO)
- Dashboards: Pre-built templates for Application, Runtime, System, Tracing views
Sample Application Instrumentation¶
- Deploy two variants: one with OTel SDK, one without
- Demonstrate SDK benefits: auto-instrumentation for common libraries
- Manual instrumentation for custom business logic
References¶
- OpenTelemetry Official Docs
- SigNoz Documentation
- BRAC POC Requirements — Observability Section (
brac_poc_mail.pdfin repo root; view on GitHub) - [Related ADR: 0003-kafka-for-telemetry-buffering]
Approval¶
- Approved by: BRAC POC Technical Lead
- Date: 2026-04-24
History¶
| Date | Status | Notes |
|---|---|---|
| 2026-04-24 | Accepted | Initial decision, approved for Phase 2 implementation |