Skip to content

ADR 0001: Choose OpenTelemetry for Observability Over Proprietary APM

Status: Accepted
Date: 2026-04-24
Authors: BRAC POC Team

Context

The POC requires comprehensive observability covering logs, metrics, and traces across 9 infrastructure components. BRAC Bank has indicated no vendor lock-in preference in their RFP requirements. The solution must demonstrate: - Vendor-neutral instrumentation - Cost-effective scaling - Multi-layer aggregation (logs, metrics, traces) - Integration with multiple data backends

Decision

Adopt OpenTelemetry (OTel) as the observability standard with SigNoz as the backend visualization and storage layer. This includes: - OTel SDK for application instrumentation - OTel Collector for aggregation and pipeline management - Kafka as the buffer between collector and backend - SigNoz for unified dashboards - ClickHouse for long-term storage with retention policies

Consequences

Positive

  • Vendor neutrality: OTel is CNCF-backed, not tied to any single vendor
  • Future flexibility: Can swap backends (SigNoz → Datadog, Grafana Loki, etc.)
  • Standard instrumentation: Developers learn one framework applicable across organizations
  • Cost control: Self-managed backend eliminates per-seat licensing
  • Multi-signal correlation: Single platform for logs, metrics, and traces reduces context-switching
  • Scalability: Kafka buffer prevents collector overload during peaks

Negative

  • Operational overhead: Must manage OTel Collector, ClickHouse, and SigNoz ourselves
  • Learning curve: Team needs OTel SDK knowledge (though adoption is growing)
  • Data consistency: Multi-signal correlation requires careful configuration
  • Retention management: ClickHouse cold archiving needs manual tuning

Neutral

  • OTel ecosystem still evolving (mature for traces, growing for logs/metrics)
  • SigNoz is younger than enterprise APM vendors (Datadog, New Relic) but improving rapidly

Alternatives Considered

Alternative 1: Datadog APM

  • Pros: Enterprise-grade, fully managed, excellent support
  • Cons: Vendor lock-in, per-host licensing becomes expensive at scale, doesn't meet BRAC neutrality preference
  • Why rejected: Contradicts BRAC's stated vendor-independence requirement

Alternative 2: Elastic Stack (ELK)

  • Pros: Open source, good for logs, wide adoption
  • Cons: Weak in distributed tracing, separate tools needed for metrics/APM, less unified experience
  • Why rejected: Fragmented observability story for this POC's multi-signal needs

Alternative 3: Prometheus + Loki + Jaeger (Best of Breed)

  • Pros: All open source, CNCF projects, mature
  • Cons: Three separate systems = three dashboards, higher operational complexity, no native correlation
  • Why rejected: Inferior UX for unified observability; SigNoz provides better integration

Implementation Notes

OTel Configuration

  • Collectors: Deployed as DaemonSets on each OpenShift node
  • Sampling: Trace sampling set to 10% for performance (configurable per service)
  • Exporters: OTLP → Kafka → SigNoz ClickHouse
  • Resource attributes: Include cluster, namespace, pod, service name for correlation

SigNoz Backend

  • Deployment: Helm chart on OpenShift
  • Data retention: 2-day hot retention in ClickHouse, cold archive to S3-compatible storage (MinIO)
  • Dashboards: Pre-built templates for Application, Runtime, System, Tracing views

Sample Application Instrumentation

  • Deploy two variants: one with OTel SDK, one without
  • Demonstrate SDK benefits: auto-instrumentation for common libraries
  • Manual instrumentation for custom business logic

References

Approval

  • Approved by: BRAC POC Technical Lead
  • Date: 2026-04-24

History

Date Status Notes
2026-04-24 Accepted Initial decision, approved for Phase 2 implementation