Error Budget
debt(d9/e5/b7/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints note automated=no, and tools like Prometheus/Datadog can surface metrics but only if you've deliberately instrumented and wired up budget tracking dashboards and alerts. Per common_mistakes, teams often only discover a breach at month end — meaning the misuse (not tracking in real time, no policy) is entirely silent in production until the damage is done.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix requires calculating the budget formula, setting up real-time tracking instrumentation, configuring multi-threshold alerts (50% and 100%), and defining an error budget policy with consequences. This spans monitoring configuration, alerting rules, SLO definitions, and team process — more than a single-line patch but not a full architectural rework.
Closest to 'strong gravitational pull' (e7, mapped to b7). Error budgets apply across web, cli, and queue-worker contexts and shape every release decision and reliability trade-off. Once adopted, every deployment, incident response, and feature velocity decision is evaluated against budget consumption. This is a persistent cross-cutting concern that influences how the entire engineering team operates, making it load-bearing across the system.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states: developers treat error budget as 'a threshold before escalation' (like an alert threshold) rather than an 'operational currency that governs release velocity.' This is a meaningful conceptual inversion — the intuitive read (budget = warning limit) is wrong, and the correct model (budget = spendable resource with policy consequences) contradicts how most developers reason about availability thresholds.
TL;DR
Explanation
Error budget = (1 - SLO) × time window. 99.9% SLO on 30 days = 43.8 min/month. Error budget policies: when > 50% consumed → review reliability. When > 100% consumed → freeze feature releases, focus on reliability. Budget tracking: burn rate alerts (fast budget consumption triggers earlier alert). Fast burn: 14.4x burn rate = exhausted in 1 hour. Slow burn: 1x = exhausted at end of window. Multi-window alerts: fast burn (1h) and slow burn (6h) combined for reliable alerting. Error budget creates shared incentive: product team wants to ship, SRE team controls budget gate.
Common Misconception
Why It Matters
Common Mistakes
- Not tracking real-time budget consumption — only discovering breach at month end.
- No error budget policy — budget means nothing without consequences.
- Counting planned downtime against budget without accounting for it in the SLA.
Code Examples
// No error budget tracking:
// 'Uptime was 99.5% this month'
// No action taken, no budget policy
// Prometheus error budget alert:
// Burn rate > 14.4 for 1h AND burn rate > 6 for 6h:
alert: FastErrorBudgetBurn
expr: |
job:slo_requests_errors:rate1h > (14.4 * 0.001)
AND
job:slo_requests_errors:rate6h > (6 * 0.001)
// Policy: > 50% consumed → reliability sprint
// > 100% consumed → feature freeze