← Back to glossary

Four Golden Signals

Observability Beginner

debt(d7/e3/b3/t5)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints list prometheus, grafana, and datadog, but automated detection is explicitly marked 'no'. Missing golden signals (e.g. no saturation monitoring) won't be flagged by any tool automatically — it only becomes apparent through manual dashboard review, incident postmortems, or when users start complaining about degraded service.

e3 Effort Remediation debt — work required to fix once spotted

Closest to 'simple parameterised fix' (e3). The quick_fix describes adding alerts for all four signals with specific thresholds (latency p99, error rate >1%, traffic anomaly, saturation). This is more than a single one-line patch but is a small, contained instrumentation task within one monitoring component rather than a cross-cutting codebase change.

b3 Burden Structural debt — long-term weight of choosing wrong

Closest to 'localised tax' (b3). Applies to web, cli, and queue-worker contexts broadly, but the burden is confined to the observability/monitoring layer. Once the four golden signals are instrumented and alerts are set, the rest of the codebase is largely unaffected. It imposes a modest ongoing maintenance tax (keeping thresholds tuned) but does not shape how application code is written.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap' (t5). The canonical misconception is 'more metrics are always better,' leading developers to add dashboards without alerts and creating noise. The specific common mistake of monitoring p50 instead of p99 latency is a documented gotcha that most developers learn after missing slow-tail user experiences. This is a well-known industry pitfall but not a catastrophic or counter-intuitive misread of the concept itself.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-08 · reviewed by human

TL;DR

Google SRE's Four Golden Signals — Latency, Traffic, Errors, Saturation — are the four metrics that, if monitored and alerted on, cover most production reliability concerns.

Explanation

(1) Latency: time to serve a request — distinguish successful vs error latency (errors should be fast, not slow). (2) Traffic: demand on the system — requests/sec, concurrent users, messages/sec. (3) Errors: rate of failed requests — 5xx responses, uncaught exceptions, failed jobs. (4) Saturation: how full the system is — CPU%, memory%, queue depth, disk I/O. Also: USE (Utilisation, Saturation, Errors) for resources; RED (Rate, Errors, Duration) for services. Start with these four before adding more metrics. Any one of these trending badly = something is wrong.

Common Misconception

✗ More metrics are always better — start with the four golden signals. Adding metrics without alerts just creates dashboard noise.

Why It Matters

The four golden signals provide a complete picture of system health from the user's perspective — if these four are green, the service is likely working correctly.

Common Mistakes

Only monitoring uptime — not latency, errors, or saturation.
Monitoring p50 latency but not p99 — p99 reveals the slow tail that users experience.
No saturation monitoring — running out of CPU/memory/connections causes gradual degradation.

Code Examples

✗ Vulnerable

// Only uptime monitoring:
alert: ServiceDown
expr: up == 0
// Misses: slow responses, error rates, resource exhaustion

✓ Fixed

// Latency:
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 0.5
// Errors:
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m]) > 0.01
// Saturation:
- alert: HighMemory
  expr: process_resident_memory_bytes / node_memory_total_bytes > 0.9

References

https://sre.google/sre-book/monitoring-distributed-systems/