← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

P50/P95/P99 Latency Percentiles

Observability Beginner
debt(d7/e3/b5/t7)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints note automated=no and the code_pattern is latency_avg|response_time_avg — meaning the misuse (monitoring averages instead of percentiles) can be flagged by Prometheus query review or dashboard audits, but there is no automated lint rule that catches it. A service can ship with average-only metrics and the problem only surfaces during performance investigation or when users complain, making it closer to d7 than d9 since a careful reviewer inspecting dashboards will spot it.

e3 Effort Remediation debt — work required to fix once spotted

Closest to 'simple parameterised fix' (e3). The quick_fix states: replace average latency metric with histogram, query P99 in dashboards and alerts, and set SLO on P99. This is a small but intentional refactor — updating the instrumentation library call, changing dashboard panels, and revising alert thresholds — touching a few files or config definitions within one component, not a cross-cutting codebase change.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). The choice of metric type (average vs. percentile) affects dashboards, alerts, SLO definitions, and incident response workflows across the team. It applies to web, cli, and queue-worker contexts. While it doesn't rewrite architectural shape, consistently using wrong metrics creates an ongoing productivity tax: every performance investigation, capacity planning exercise, and SLO review is degraded by misleading data.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The misconception field explicitly states 'Average latency is sufficient for monitoring — average hides slow outliers.' A competent developer familiar with averages in other domains (CPU usage, error rates) will instinctively reach for avg() in their monitoring queries, not realising that latency distributions are skewed and that a 50ms average can coexist with a 5s P99. Additionally, the common mistake of aggregating percentiles across instances (statistically invalid) compounds the trap for those who do adopt percentiles but apply them incorrectly.

About DEBT scoring →

TL;DR

Latency percentiles (P50, P95, P99) tell you what most users experience — P99 means '99% of requests are faster than this', revealing the worst experiences that averages hide.

Explanation

Average latency is misleading — a 100ms average can mask 5% of requests taking 2 seconds. Percentiles: P50 (median — half faster), P95 (95% faster — almost everyone), P99 (99% faster — worst 1%), P99.9 (worst 0.1%). P50 ≈ typical user. P99 = power users or large-data users. P99.9 = outliers (usually infrastructure issues). Implementation: histogram metrics in Prometheus. histogram_quantile(0.99, rate(http_duration_bucket[5m])). Aggregating percentiles: can't average percentiles across instances — must use histogram buckets. Set SLO on P99, not average.

Common Misconception

Average latency is sufficient for monitoring — average hides slow outliers. A service with 50ms average and 5s P99 has serious performance issues that average masks.

Why It Matters

P99 latency determines whether power users and high-traffic moments are acceptable — averages let you ship a slow service believing it's fast.

Common Mistakes

  • Monitoring average instead of percentiles.
  • Aggregating percentiles from different instances — statistically invalid.
  • Setting SLO on P50 — only half of users satisfy it.

Code Examples

✗ Vulnerable
// Average latency metric — hides outliers:
Gauge::set('latency_avg', $totalTime / $count);
// 100ms average, but 1% of requests take 5s
✓ Fixed
// Prometheus histogram — correct percentiles:
$histogram = $meter->createHistogram('http.request.duration');
$histogram->record($durationMs, ['route' => $route]);

// Query P99:
// histogram_quantile(0.99, rate(http_request_duration_bucket[5m]))

// SLO: P99 < 500ms

Added 23 Mar 2026
Views 79
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 1 ping T 0 pings F 1 ping S 0 pings S 0 pings M 0 pings T 1 ping W 2 pings T 1 ping F 2 pings S 0 pings S 2 pings M 0 pings T 1 ping W 0 pings T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 2 pings T 1 ping W
PetalBot 1
PetalBot 1 SEMrush 1
Amazonbot 17 Google 9 Perplexity 8 Scrapy 6 Ahrefs 5 ChatGPT 4 Unknown AI 4 Majestic 3 Meta AI 2 Claude 2 SEMrush 2 PetalBot 2
crawler 59 crawler_json 4 pre-tracking 1
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: Low
⚡ Quick Fix
Replace average latency with histogram metric. Query P99 in dashboards and alerts. Set SLO on P99, not P50. Use P999 for finding infrastructure outliers.
📦 Applies To
web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
latency_avg|response_time_avg
Auto-detectable: ✗ No prometheus
⚠ Related Problems
🤖 AI Agent
Confidence: Medium False Positives: Medium ✗ Manual fix Fix: Low Context: File


✓ schema.org compliant