← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

Four Golden Signals

Observability Beginner
debt(d7/e3/b3/t5)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints list prometheus, grafana, and datadog, but automated detection is explicitly marked 'no'. Missing golden signals (e.g. no saturation monitoring) won't be flagged by any tool automatically — it only becomes apparent through manual dashboard review, incident postmortems, or when users start complaining about degraded service.

e3 Effort Remediation debt — work required to fix once spotted

Closest to 'simple parameterised fix' (e3). The quick_fix describes adding alerts for all four signals with specific thresholds (latency p99, error rate >1%, traffic anomaly, saturation). This is more than a single one-line patch but is a small, contained instrumentation task within one monitoring component rather than a cross-cutting codebase change.

b3 Burden Structural debt — long-term weight of choosing wrong

Closest to 'localised tax' (b3). Applies to web, cli, and queue-worker contexts broadly, but the burden is confined to the observability/monitoring layer. Once the four golden signals are instrumented and alerts are set, the rest of the codebase is largely unaffected. It imposes a modest ongoing maintenance tax (keeping thresholds tuned) but does not shape how application code is written.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap' (t5). The canonical misconception is 'more metrics are always better,' leading developers to add dashboards without alerts and creating noise. The specific common mistake of monitoring p50 instead of p99 latency is a documented gotcha that most developers learn after missing slow-tail user experiences. This is a well-known industry pitfall but not a catastrophic or counter-intuitive misread of the concept itself.

About DEBT scoring →

TL;DR

Google SRE's Four Golden Signals — Latency, Traffic, Errors, Saturation — are the four metrics that, if monitored and alerted on, cover most production reliability concerns.

Explanation

(1) Latency: time to serve a request — distinguish successful vs error latency (errors should be fast, not slow). (2) Traffic: demand on the system — requests/sec, concurrent users, messages/sec. (3) Errors: rate of failed requests — 5xx responses, uncaught exceptions, failed jobs. (4) Saturation: how full the system is — CPU%, memory%, queue depth, disk I/O. Also: USE (Utilisation, Saturation, Errors) for resources; RED (Rate, Errors, Duration) for services. Start with these four before adding more metrics. Any one of these trending badly = something is wrong.

Common Misconception

More metrics are always better — start with the four golden signals. Adding metrics without alerts just creates dashboard noise.

Why It Matters

The four golden signals provide a complete picture of system health from the user's perspective — if these four are green, the service is likely working correctly.

Common Mistakes

  • Only monitoring uptime — not latency, errors, or saturation.
  • Monitoring p50 latency but not p99 — p99 reveals the slow tail that users experience.
  • No saturation monitoring — running out of CPU/memory/connections causes gradual degradation.

Code Examples

✗ Vulnerable
// Only uptime monitoring:
alert: ServiceDown
expr: up == 0
// Misses: slow responses, error rates, resource exhaustion
✓ Fixed
// Latency:
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 0.5
// Errors:
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m]) > 0.01
// Saturation:
- alert: HighMemory
  expr: process_resident_memory_bytes / node_memory_total_bytes > 0.9

Added 23 Mar 2026
Views 66
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 1 ping T 0 pings F 2 pings S 0 pings S 0 pings M 0 pings T 1 ping W 1 ping T 5 pings F 2 pings S 1 ping S 2 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 1 ping W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W
No pings yet today
No pings yesterday
Amazonbot 9 Scrapy 8 Perplexity 7 Ahrefs 4 Google 4 SEMrush 4 ChatGPT 3 Unknown AI 3 Majestic 2 Bing 2 Claude 1 Meta AI 1 Sogou 1
crawler 44 crawler_json 3 pre-tracking 2
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Add alerts for all four signals: latency p99 > threshold, error rate > 1%, traffic anomaly, CPU/memory/queue saturation. Use p99, not average.
📦 Applies To
web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
Auto-detectable: ✗ No prometheus grafana datadog
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: Medium Context: File


✓ schema.org compliant