← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Error Budget

observability Intermediate
debt(d9/e5/b7/t7)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints note automated=no, and tools like Prometheus/Datadog can surface metrics but only if you've deliberately instrumented and wired up budget tracking dashboards and alerts. Per common_mistakes, teams often only discover a breach at month end — meaning the misuse (not tracking in real time, no policy) is entirely silent in production until the damage is done.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix requires calculating the budget formula, setting up real-time tracking instrumentation, configuring multi-threshold alerts (50% and 100%), and defining an error budget policy with consequences. This spans monitoring configuration, alerting rules, SLO definitions, and team process — more than a single-line patch but not a full architectural rework.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (e7, mapped to b7). Error budgets apply across web, cli, and queue-worker contexts and shape every release decision and reliability trade-off. Once adopted, every deployment, incident response, and feature velocity decision is evaluated against budget consumption. This is a persistent cross-cutting concern that influences how the entire engineering team operates, making it load-bearing across the system.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states: developers treat error budget as 'a threshold before escalation' (like an alert threshold) rather than an 'operational currency that governs release velocity.' This is a meaningful conceptual inversion — the intuitive read (budget = warning limit) is wrong, and the correct model (budget = spendable resource with policy consequences) contradicts how most developers reason about availability thresholds.

About DEBT scoring →

TL;DR

Error budget is the allowed amount of unreliability within an SLO period — 99.9% SLO = 43.8 min/month downtime allowed. When budget is exhausted, reliability takes priority over features.

Explanation

Error budget = (1 - SLO) × time window. 99.9% SLO on 30 days = 43.8 min/month. Error budget policies: when > 50% consumed → review reliability. When > 100% consumed → freeze feature releases, focus on reliability. Budget tracking: burn rate alerts (fast budget consumption triggers earlier alert). Fast burn: 14.4x burn rate = exhausted in 1 hour. Slow burn: 1x = exhausted at end of window. Multi-window alerts: fast burn (1h) and slow burn (6h) combined for reliable alerting. Error budget creates shared incentive: product team wants to ship, SRE team controls budget gate.

Common Misconception

Error budget is a threshold before escalation — it's an operational currency that governs release velocity. Spending it on planned downtime is valid; unexpected failures should trigger reliability work.

Why It Matters

Error budget transforms reliability from a vague 'be careful' to a quantified resource — engineering teams make explicit trade-offs between velocity and reliability.

Common Mistakes

  • Not tracking real-time budget consumption — only discovering breach at month end.
  • No error budget policy — budget means nothing without consequences.
  • Counting planned downtime against budget without accounting for it in the SLA.

Code Examples

✗ Vulnerable
// No error budget tracking:
// 'Uptime was 99.5% this month'
// No action taken, no budget policy
✓ Fixed
// Prometheus error budget alert:
// Burn rate > 14.4 for 1h AND burn rate > 6 for 6h:
alert: FastErrorBudgetBurn
expr: |
  job:slo_requests_errors:rate1h > (14.4 * 0.001)
  AND
  job:slo_requests_errors:rate6h > (6 * 0.001)

// Policy: > 50% consumed → reliability sprint
// > 100% consumed → feature freeze

Added 23 Mar 2026
Views 31
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings F 1 ping S 0 pings S 1 ping M 0 pings T 0 pings W 1 ping T 0 pings F 2 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 1 ping S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 2 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 0 pings F
No pings yet today
Amazonbot 10 Perplexity 6 Ahrefs 3 Google 2 Majestic 1 Unknown AI 1 SEMrush 1
crawler 23 crawler_json 1
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Calculate: (1 - SLO) × days × 24h. Track in real time. Alert on 50% and 100% consumption. Define policy: what changes when budget is exhausted?
📦 Applies To
web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
Auto-detectable: ✗ No prometheus datadog
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: High Context: File

✓ schema.org compliant