← Back to glossary

Error Budget

observability Intermediate

debt(d9/e5/b7/t7)

d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints note automated=no, and tools like Prometheus/Datadog can surface metrics but only if you've deliberately instrumented and wired up budget tracking dashboards and alerts. Per common_mistakes, teams often only discover a breach at month end — meaning the misuse (not tracking in real time, no policy) is entirely silent in production until the damage is done.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix requires calculating the budget formula, setting up real-time tracking instrumentation, configuring multi-threshold alerts (50% and 100%), and defining an error budget policy with consequences. This spans monitoring configuration, alerting rules, SLO definitions, and team process — more than a single-line patch but not a full architectural rework.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (e7, mapped to b7). Error budgets apply across web, cli, and queue-worker contexts and shape every release decision and reliability trade-off. Once adopted, every deployment, incident response, and feature velocity decision is evaluated against budget consumption. This is a persistent cross-cutting concern that influences how the entire engineering team operates, making it load-bearing across the system.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states: developers treat error budget as 'a threshold before escalation' (like an alert threshold) rather than an 'operational currency that governs release velocity.' This is a meaningful conceptual inversion — the intuitive read (budget = warning limit) is wrong, and the correct model (budget = spendable resource with policy consequences) contradicts how most developers reason about availability thresholds.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-07 · reviewed by human

TL;DR

Error budget is the allowed amount of unreliability within an SLO period — 99.9% SLO = 43.8 min/month downtime allowed. When budget is exhausted, reliability takes priority over features.

Explanation

Error budget = (1 - SLO) × time window. 99.9% SLO on 30 days = 43.8 min/month. Error budget policies: when > 50% consumed → review reliability. When > 100% consumed → freeze feature releases, focus on reliability. Budget tracking: burn rate alerts (fast budget consumption triggers earlier alert). Fast burn: 14.4x burn rate = exhausted in 1 hour. Slow burn: 1x = exhausted at end of window. Multi-window alerts: fast burn (1h) and slow burn (6h) combined for reliable alerting. Error budget creates shared incentive: product team wants to ship, SRE team controls budget gate.

Common Misconception

✗ Error budget is a threshold before escalation — it's an operational currency that governs release velocity. Spending it on planned downtime is valid; unexpected failures should trigger reliability work.

Why It Matters

Error budget transforms reliability from a vague 'be careful' to a quantified resource — engineering teams make explicit trade-offs between velocity and reliability.

Common Mistakes

Not tracking real-time budget consumption — only discovering breach at month end.
No error budget policy — budget means nothing without consequences.
Counting planned downtime against budget without accounting for it in the SLA.

Code Examples

✗ Vulnerable

// No error budget tracking:
// 'Uptime was 99.5% this month'
// No action taken, no budget policy

✓ Fixed

// Prometheus error budget alert:
// Burn rate > 14.4 for 1h AND burn rate > 6 for 6h:
alert: FastErrorBudgetBurn
expr: |
  job:slo_requests_errors:rate1h > (14.4 * 0.001)
  AND
  job:slo_requests_errors:rate6h > (6 * 0.001)

// Policy: > 50% consumed → reliability sprint
// > 100% consumed → feature freeze

References

↗ https://sre.google/workbook/error-budget-policy/