On-Call & Runbooks
TL;DR
A runbook documents how to diagnose and resolve specific alerts — on-call engineers shouldn't have to think from scratch at 3am; the runbook provides the playbook.
Explanation
Runbook: step-by-step guide for an alert. Contents: what this alert means, what to check first, common causes, diagnostic commands, escalation path, rollback procedure. Living document — update after every incident. On-call practices: primary + secondary rotation, escalation policy (no response in 15min → secondary), alerting fatigue review (reduce pages < 5/week), post-incident reviews (PIR/blameless postmortem). Runbook hosting: wiki, PagerDuty runbooks, Confluence. Include: expected steady-state values, links to dashboards, known false positives, who to escalate to for domain-specific issues.
Common Misconception
✗ Runbooks are for junior engineers — runbooks are for everyone at 3am. Even experts make mistakes when exhausted. Runbooks also capture institutional knowledge before it walks out the door.
Why It Matters
Runbooks reduce mean time to recovery (MTTR) by giving on-call engineers a starting point — without them, each incident starts from scratch and relies on heroics.
Common Mistakes
- No runbook for alerts — on-call has to figure it out every time.
- Outdated runbooks — worse than none because they mislead.
- Runbooks without diagnostic commands — 'check the database' isn't actionable.
- No escalation path — on-call stuck without domain expert contact.
Code Examples
✗ Vulnerable
# Alert without runbook:
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
summary: Error rate high
# No runbook link — on-call guesses what to do
✓ Fixed
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
summary: Error rate above 5%
runbook: https://wiki/runbooks/high-error-rate
dashboard: https://grafana/d/xyz
severity: page
# Runbook content:
# 1. Check dashboard link above
# 2. grep 'ERROR' in logs: `kubectl logs -n prod -l app=api | grep ERROR | head -20`
# 3. Common causes: DB connection pool exhausted, upstream timeout
# 4. Rollback: `kubectl rollout undo deployment/api`
# 5. Escalate to: @backend-oncall in Slack
References
Tags
🤝 Adopt this term
£79/year · your link shown here
Added
23 Mar 2026
Views
39
🤖 AI Guestbook educational data only
|
|
Last 30 days
Agents 0
No pings yet today
No pings yesterday
Amazonbot 14
Perplexity 9
Google 5
Unknown AI 4
Ahrefs 3
ChatGPT 2
Majestic 2
Also referenced
How they use it
crawler 34
crawler_json 3
pre-tracking 2
Related categories
⚡
DEV INTEL
Tools & Severity
🟠 High
⚙ Fix effort: Medium
⚡ Quick Fix
Create runbook for every alert. Include: what it means, diagnostic commands, common causes, rollback procedure, escalation. Link runbook URL from alert annotations. Review quarterly.
📦 Applies To
web
cli
queue-worker
🔗 Prerequisites
🔍 Detection Hints
Auto-detectable:
✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Low
False Positives: High
✗ Manual fix
Fix: Low
Context: File