Three Pillars of Observability
debt(d9/e7/b7/t5)
Closest to 'silent in production until users hit it' (d9). The detection_hints state automated=no, and the absence of structured logging, metrics endpoints, or distributed tracing produces no compiler or linter warnings. The code pattern is purely operational — 'debugging requires SSH to production servers' — meaning gaps are only felt during incidents, never during development or CI.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix outlines a staged three-step process (structured logging → metrics endpoint → OpenTelemetry traces) spanning multiple systems, configuration layers, and potentially infrastructure. Adding trace IDs to logs, exposing a /metrics endpoint, and wiring OpenTelemetry auto-instrumentation touches application code, deployment configuration, and logging pipelines across the entire codebase, not a single file or component.
Closest to 'strong gravitational pull' (b7). The applies_to covers web, cli, and queue-worker contexts — i.e., every runtime context. Once adopted (or absent), observability shapes how every future incident is diagnosed and how every new service must be instrumented. The common_mistakes (no correlation between pillars, no structured fields, no percentiles) compound over time as the codebase grows, making every new feature carry the tax of the chosen observability posture.
Closest to 'notable trap — a documented gotcha most devs eventually learn' (t5). The misconception field states the canonical wrong belief: 'Logs are sufficient for observability.' This is a well-known professional pitfall — developers who know logging well assume it covers the full observability space, not realising metrics and traces address fundamentally different questions. It is documented and commonly encountered but does not fully contradict how a similar concept works elsewhere.
Also Known As
TL;DR
Explanation
Logs: timestamped records of discrete events — what happened. Metrics: numeric measurements aggregated over time — how much/often. Traces: the path of a single request through distributed services — where is it slow. Each pillar answers a different question. Logs explain why; metrics alert you something is wrong; traces show you where. OpenTelemetry standardises all three. Correlation IDs linking logs to traces are the glue that makes the pillars useful together. A missing pillar leaves a blind spot: metrics without traces cannot pinpoint which service is slow.
Common Misconception
Why It Matters
Common Mistakes
- Logs without structured fields — free-text logs cannot be queried for specific values efficiently.
- Metrics without percentiles — average response time hides the tail experience that real users suffer.
- Traces without logs — a slow trace tells you where but not why; logs in context explain the reason.
- No correlation between pillars — trace IDs not included in logs means the pillars cannot be joined during investigation.
Code Examples
// Logs only — cannot answer 'what is the p99?' or 'which service is slow?':
error_log('[2026-03-15 12:34:56] Request to /api/orders took 8423ms');
// Thousands of these lines
// To find p99: grep, awk, sort — minutes of manual work
// Which downstream call was slow? Unknown — no traces
// All three pillars with correlation:
// Metric:
$histogram->observe($duration, ['route' => '/api/orders']);
// Structured log with trace ID:
$logger->info('Request complete', [
'duration_ms' => $duration * 1000,
'route' => '/api/orders',
'trace_id' => $span->getContext()->getTraceId(), // Links to trace
]);
// Trace automatically captures downstream calls
// In an incident: alert (metric) → dashboard (metrics) → trace → logs