← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

On-Call & Runbooks

Observability Beginner
debt(d9/e5/b7/t5)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints field explicitly states 'automated: no' — there is no tool that catches missing or outdated runbooks. The absence only becomes apparent when an incident occurs and on-call engineers are left without guidance, exactly the worst possible moment.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix requires creating a runbook for every alert, linking URLs from alert annotations, including diagnostic commands, escalation paths, and rollback procedures, then establishing a quarterly review cadence. This touches alerting configuration, documentation systems, and team processes — more than a one-liner but short of a full architectural rework.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). The common_mistakes show that every alert must have a linked runbook; every incident response is shaped by whether runbooks exist and are current. The applies_to covers web, cli, and queue-worker contexts — the entire operational surface. Outdated or missing runbooks actively slow every on-call rotation and incident, creating a persistent and wide-reaching burden on all future maintainers and responders.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap (a documented gotcha most devs eventually learn)' (t5). The misconception field identifies a specific, well-documented wrong belief: that runbooks are only for junior engineers. In reality they are critical for everyone under stress. Additionally, the common_mistakes flag that outdated runbooks are worse than none — a subtle but serious trap where having documentation creates false confidence and actively misleads responders.

About DEBT scoring →

TL;DR

A runbook documents how to diagnose and resolve specific alerts — on-call engineers shouldn't have to think from scratch at 3am; the runbook provides the playbook.

Explanation

Runbook: step-by-step guide for an alert. Contents: what this alert means, what to check first, common causes, diagnostic commands, escalation path, rollback procedure. Living document — update after every incident. On-call practices: primary + secondary rotation, escalation policy (no response in 15min → secondary), alerting fatigue review (reduce pages < 5/week), post-incident reviews (PIR/blameless postmortem). Runbook hosting: wiki, PagerDuty runbooks, Confluence. Include: expected steady-state values, links to dashboards, known false positives, who to escalate to for domain-specific issues.

Common Misconception

Runbooks are for junior engineers — runbooks are for everyone at 3am. Even experts make mistakes when exhausted. Runbooks also capture institutional knowledge before it walks out the door.

Why It Matters

Runbooks reduce mean time to recovery (MTTR) by giving on-call engineers a starting point — without them, each incident starts from scratch and relies on heroics.

Common Mistakes

  • No runbook for alerts — on-call has to figure it out every time.
  • Outdated runbooks — worse than none because they mislead.
  • Runbooks without diagnostic commands — 'check the database' isn't actionable.
  • No escalation path — on-call stuck without domain expert contact.

Code Examples

✗ Vulnerable
# Alert without runbook:
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
  summary: Error rate high
  # No runbook link — on-call guesses what to do
✓ Fixed
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
  summary: Error rate above 5%
  runbook: https://wiki/runbooks/high-error-rate
  dashboard: https://grafana/d/xyz
  severity: page

# Runbook content:
# 1. Check dashboard link above
# 2. grep 'ERROR' in logs: `kubectl logs -n prod -l app=api | grep ERROR | head -20`
# 3. Common causes: DB connection pool exhausted, upstream timeout
# 4. Rollback: `kubectl rollout undo deployment/api`
# 5. Escalate to: @backend-oncall in Slack

Added 23 Mar 2026
Views 65
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 0 pings T 0 pings F 1 ping S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 2 pings S 0 pings S 1 ping M 0 pings T 1 ping W 1 ping T 0 pings F 0 pings S 0 pings S 1 ping M 1 ping T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W
No pings yet today
No pings yesterday
Amazonbot 17 Perplexity 9 Google 7 Ahrefs 5 ChatGPT 4 Unknown AI 4 Majestic 3 Meta AI 2 SEMrush 2 Scrapy 2 Claude 1 Bing 1 PetalBot 1
crawler 51 crawler_json 5 pre-tracking 2
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Create runbook for every alert. Include: what it means, diagnostic commands, common causes, rollback procedure, escalation. Link runbook URL from alert annotations. Review quarterly.
📦 Applies To
web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: Low Context: File


✓ schema.org compliant