On-Call & Runbooks
debt(d9/e5/b7/t5)
Closest to 'silent in production until users hit it' (d9). The detection_hints field explicitly states 'automated: no' — there is no tool that catches missing or outdated runbooks. The absence only becomes apparent when an incident occurs and on-call engineers are left without guidance, exactly the worst possible moment.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix requires creating a runbook for every alert, linking URLs from alert annotations, including diagnostic commands, escalation paths, and rollback procedures, then establishing a quarterly review cadence. This touches alerting configuration, documentation systems, and team processes — more than a one-liner but short of a full architectural rework.
Closest to 'strong gravitational pull' (b7). The common_mistakes show that every alert must have a linked runbook; every incident response is shaped by whether runbooks exist and are current. The applies_to covers web, cli, and queue-worker contexts — the entire operational surface. Outdated or missing runbooks actively slow every on-call rotation and incident, creating a persistent and wide-reaching burden on all future maintainers and responders.
Closest to 'notable trap (a documented gotcha most devs eventually learn)' (t5). The misconception field identifies a specific, well-documented wrong belief: that runbooks are only for junior engineers. In reality they are critical for everyone under stress. Additionally, the common_mistakes flag that outdated runbooks are worse than none — a subtle but serious trap where having documentation creates false confidence and actively misleads responders.
TL;DR
Explanation
Runbook: step-by-step guide for an alert. Contents: what this alert means, what to check first, common causes, diagnostic commands, escalation path, rollback procedure. Living document — update after every incident. On-call practices: primary + secondary rotation, escalation policy (no response in 15min → secondary), alerting fatigue review (reduce pages < 5/week), post-incident reviews (PIR/blameless postmortem). Runbook hosting: wiki, PagerDuty runbooks, Confluence. Include: expected steady-state values, links to dashboards, known false positives, who to escalate to for domain-specific issues.
Common Misconception
Why It Matters
Common Mistakes
- No runbook for alerts — on-call has to figure it out every time.
- Outdated runbooks — worse than none because they mislead.
- Runbooks without diagnostic commands — 'check the database' isn't actionable.
- No escalation path — on-call stuck without domain expert contact.
Code Examples
# Alert without runbook:
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
summary: Error rate high
# No runbook link — on-call guesses what to do
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
summary: Error rate above 5%
runbook: https://wiki/runbooks/high-error-rate
dashboard: https://grafana/d/xyz
severity: page
# Runbook content:
# 1. Check dashboard link above
# 2. grep 'ERROR' in logs: `kubectl logs -n prod -l app=api | grep ERROR | head -20`
# 3. Common causes: DB connection pool exhausted, upstream timeout
# 4. Rollback: `kubectl rollout undo deployment/api`
# 5. Escalate to: @backend-oncall in Slack