← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Three Pillars of Observability

devops Intermediate
debt(d9/e7/b7/t5)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints state automated=no, and the absence of structured logging, metrics endpoints, or distributed tracing produces no compiler or linter warnings. The code pattern is purely operational — 'debugging requires SSH to production servers' — meaning gaps are only felt during incidents, never during development or CI.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix outlines a staged three-step process (structured logging → metrics endpoint → OpenTelemetry traces) spanning multiple systems, configuration layers, and potentially infrastructure. Adding trace IDs to logs, exposing a /metrics endpoint, and wiring OpenTelemetry auto-instrumentation touches application code, deployment configuration, and logging pipelines across the entire codebase, not a single file or component.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). The applies_to covers web, cli, and queue-worker contexts — i.e., every runtime context. Once adopted (or absent), observability shapes how every future incident is diagnosed and how every new service must be instrumented. The common_mistakes (no correlation between pillars, no structured fields, no percentiles) compound over time as the codebase grows, making every new feature carry the tax of the chosen observability posture.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap — a documented gotcha most devs eventually learn' (t5). The misconception field states the canonical wrong belief: 'Logs are sufficient for observability.' This is a well-known professional pitfall — developers who know logging well assume it covers the full observability space, not realising metrics and traces address fundamentally different questions. It is documented and commonly encountered but does not fully contradict how a similar concept works elsewhere.

About DEBT scoring →

Also Known As

logs metrics traces observability three pillars

TL;DR

Logs (events), metrics (measurements), and traces (request flows) are the three pillars of observability — together they answer 'what happened, how much, and where.'

Explanation

Logs: timestamped records of discrete events — what happened. Metrics: numeric measurements aggregated over time — how much/often. Traces: the path of a single request through distributed services — where is it slow. Each pillar answers a different question. Logs explain why; metrics alert you something is wrong; traces show you where. OpenTelemetry standardises all three. Correlation IDs linking logs to traces are the glue that makes the pillars useful together. A missing pillar leaves a blind spot: metrics without traces cannot pinpoint which service is slow.

Common Misconception

Logs are sufficient for observability — logs tell you what happened but cannot efficiently answer 'what is the p99 latency right now?' (needs metrics) or 'which downstream call is slow?' (needs traces).

Why It Matters

An incident with only logs takes hours to diagnose; with all three pillars it takes minutes — the pillars are complementary, not substitutes.

Common Mistakes

  • Logs without structured fields — free-text logs cannot be queried for specific values efficiently.
  • Metrics without percentiles — average response time hides the tail experience that real users suffer.
  • Traces without logs — a slow trace tells you where but not why; logs in context explain the reason.
  • No correlation between pillars — trace IDs not included in logs means the pillars cannot be joined during investigation.

Code Examples

✗ Vulnerable
// Logs only — cannot answer 'what is the p99?' or 'which service is slow?':
error_log('[2026-03-15 12:34:56] Request to /api/orders took 8423ms');
// Thousands of these lines
// To find p99: grep, awk, sort — minutes of manual work
// Which downstream call was slow? Unknown — no traces
✓ Fixed
// All three pillars with correlation:
// Metric:
$histogram->observe($duration, ['route' => '/api/orders']);

// Structured log with trace ID:
$logger->info('Request complete', [
    'duration_ms' => $duration * 1000,
    'route'       => '/api/orders',
    'trace_id'    => $span->getContext()->getTraceId(), // Links to trace
]);

// Trace automatically captures downstream calls
// In an incident: alert (metric) → dashboard (metrics) → trace → logs

Added 15 Mar 2026
Edited 22 Mar 2026
Views 37
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 1 ping S 1 ping M 0 pings T 1 ping W 0 pings T 1 ping F 0 pings S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 0 pings S 1 ping S 1 ping M 0 pings T 0 pings W 0 pings T 0 pings F
No pings yet today
No pings yesterday
Amazonbot 10 Perplexity 5 Google 4 Unknown AI 3 Ahrefs 2 SEMrush 2 ChatGPT 2
crawler 25 crawler_json 2 pre-tracking 1
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: High
⚡ Quick Fix
Start with the easiest pillar: structured logging (add JSON formatter to Monolog) → then metrics (expose /metrics endpoint) → then traces (OpenTelemetry auto-instrumentation)
📦 Applies To
any web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
No structured logging; no metrics endpoint; no distributed tracing; debugging requires SSH to production servers
Auto-detectable: ✗ No opentelemetry prometheus grafana loki
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: High Context: File

✓ schema.org compliant