P50/P95/P99 Latency Percentiles
debt(d7/e3/b5/t7)
Closest to 'only careful code review or runtime testing' (d7). The detection_hints note automated=no and the code_pattern is latency_avg|response_time_avg — meaning the misuse (monitoring averages instead of percentiles) can be flagged by Prometheus query review or dashboard audits, but there is no automated lint rule that catches it. A service can ship with average-only metrics and the problem only surfaces during performance investigation or when users complain, making it closer to d7 than d9 since a careful reviewer inspecting dashboards will spot it.
Closest to 'simple parameterised fix' (e3). The quick_fix states: replace average latency metric with histogram, query P99 in dashboards and alerts, and set SLO on P99. This is a small but intentional refactor — updating the instrumentation library call, changing dashboard panels, and revising alert thresholds — touching a few files or config definitions within one component, not a cross-cutting codebase change.
Closest to 'persistent productivity tax' (b5). The choice of metric type (average vs. percentile) affects dashboards, alerts, SLO definitions, and incident response workflows across the team. It applies to web, cli, and queue-worker contexts. While it doesn't rewrite architectural shape, consistently using wrong metrics creates an ongoing productivity tax: every performance investigation, capacity planning exercise, and SLO review is degraded by misleading data.
Closest to 'serious trap' (t7). The misconception field explicitly states 'Average latency is sufficient for monitoring — average hides slow outliers.' A competent developer familiar with averages in other domains (CPU usage, error rates) will instinctively reach for avg() in their monitoring queries, not realising that latency distributions are skewed and that a 50ms average can coexist with a 5s P99. Additionally, the common mistake of aggregating percentiles across instances (statistically invalid) compounds the trap for those who do adopt percentiles but apply them incorrectly.
TL;DR
Explanation
Average latency is misleading — a 100ms average can mask 5% of requests taking 2 seconds. Percentiles: P50 (median — half faster), P95 (95% faster — almost everyone), P99 (99% faster — worst 1%), P99.9 (worst 0.1%). P50 ≈ typical user. P99 = power users or large-data users. P99.9 = outliers (usually infrastructure issues). Implementation: histogram metrics in Prometheus. histogram_quantile(0.99, rate(http_duration_bucket[5m])). Aggregating percentiles: can't average percentiles across instances — must use histogram buckets. Set SLO on P99, not average.
Common Misconception
Why It Matters
Common Mistakes
- Monitoring average instead of percentiles.
- Aggregating percentiles from different instances — statistically invalid.
- Setting SLO on P50 — only half of users satisfy it.
Code Examples
// Average latency metric — hides outliers:
Gauge::set('latency_avg', $totalTime / $count);
// 100ms average, but 1% of requests take 5s
// Prometheus histogram — correct percentiles:
$histogram = $meter->createHistogram('http.request.duration');
$histogram->record($durationMs, ['route' => $route]);
// Query P99:
// histogram_quantile(0.99, rate(http_request_duration_bucket[5m]))
// SLO: P99 < 500ms