Four Golden Signals
debt(d7/e3/b3/t5)
Closest to 'only careful code review or runtime testing' (d7). The detection_hints list prometheus, grafana, and datadog, but automated detection is explicitly marked 'no'. Missing golden signals (e.g. no saturation monitoring) won't be flagged by any tool automatically — it only becomes apparent through manual dashboard review, incident postmortems, or when users start complaining about degraded service.
Closest to 'simple parameterised fix' (e3). The quick_fix describes adding alerts for all four signals with specific thresholds (latency p99, error rate >1%, traffic anomaly, saturation). This is more than a single one-line patch but is a small, contained instrumentation task within one monitoring component rather than a cross-cutting codebase change.
Closest to 'localised tax' (b3). Applies to web, cli, and queue-worker contexts broadly, but the burden is confined to the observability/monitoring layer. Once the four golden signals are instrumented and alerts are set, the rest of the codebase is largely unaffected. It imposes a modest ongoing maintenance tax (keeping thresholds tuned) but does not shape how application code is written.
Closest to 'notable trap' (t5). The canonical misconception is 'more metrics are always better,' leading developers to add dashboards without alerts and creating noise. The specific common mistake of monitoring p50 instead of p99 latency is a documented gotcha that most developers learn after missing slow-tail user experiences. This is a well-known industry pitfall but not a catastrophic or counter-intuitive misread of the concept itself.
TL;DR
Explanation
(1) Latency: time to serve a request — distinguish successful vs error latency (errors should be fast, not slow). (2) Traffic: demand on the system — requests/sec, concurrent users, messages/sec. (3) Errors: rate of failed requests — 5xx responses, uncaught exceptions, failed jobs. (4) Saturation: how full the system is — CPU%, memory%, queue depth, disk I/O. Also: USE (Utilisation, Saturation, Errors) for resources; RED (Rate, Errors, Duration) for services. Start with these four before adding more metrics. Any one of these trending badly = something is wrong.
Common Misconception
Why It Matters
Common Mistakes
- Only monitoring uptime — not latency, errors, or saturation.
- Monitoring p50 latency but not p99 — p99 reveals the slow tail that users experience.
- No saturation monitoring — running out of CPU/memory/connections causes gradual degradation.
Code Examples
// Only uptime monitoring:
alert: ServiceDown
expr: up == 0
// Misses: slow responses, error rates, resource exhaustion
// Latency:
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 0.5
// Errors:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m]) > 0.01
// Saturation:
- alert: HighMemory
expr: process_resident_memory_bytes / node_memory_total_bytes > 0.9