← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

On-Call & Runbooks

observability Beginner

TL;DR

A runbook documents how to diagnose and resolve specific alerts — on-call engineers shouldn't have to think from scratch at 3am; the runbook provides the playbook.

Explanation

Runbook: step-by-step guide for an alert. Contents: what this alert means, what to check first, common causes, diagnostic commands, escalation path, rollback procedure. Living document — update after every incident. On-call practices: primary + secondary rotation, escalation policy (no response in 15min → secondary), alerting fatigue review (reduce pages < 5/week), post-incident reviews (PIR/blameless postmortem). Runbook hosting: wiki, PagerDuty runbooks, Confluence. Include: expected steady-state values, links to dashboards, known false positives, who to escalate to for domain-specific issues.

Common Misconception

Runbooks are for junior engineers — runbooks are for everyone at 3am. Even experts make mistakes when exhausted. Runbooks also capture institutional knowledge before it walks out the door.

Why It Matters

Runbooks reduce mean time to recovery (MTTR) by giving on-call engineers a starting point — without them, each incident starts from scratch and relies on heroics.

Common Mistakes

  • No runbook for alerts — on-call has to figure it out every time.
  • Outdated runbooks — worse than none because they mislead.
  • Runbooks without diagnostic commands — 'check the database' isn't actionable.
  • No escalation path — on-call stuck without domain expert contact.

Code Examples

✗ Vulnerable
# Alert without runbook:
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
  summary: Error rate high
  # No runbook link — on-call guesses what to do
✓ Fixed
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
  summary: Error rate above 5%
  runbook: https://wiki/runbooks/high-error-rate
  dashboard: https://grafana/d/xyz
  severity: page

# Runbook content:
# 1. Check dashboard link above
# 2. grep 'ERROR' in logs: `kubectl logs -n prod -l app=api | grep ERROR | head -20`
# 3. Common causes: DB connection pool exhausted, upstream timeout
# 4. Rollback: `kubectl rollout undo deployment/api`
# 5. Escalate to: @backend-oncall in Slack

Added 23 Mar 2026
Views 39
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 1 ping S 0 pings M 0 pings T 1 ping W 0 pings T 0 pings F 2 pings S 0 pings S 1 ping M 0 pings T 1 ping W 0 pings T 1 ping F 2 pings S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T
No pings yet today
No pings yesterday
Amazonbot 14 Perplexity 9 Google 5 Unknown AI 4 Ahrefs 3 ChatGPT 2 Majestic 2
crawler 34 crawler_json 3 pre-tracking 2
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Create runbook for every alert. Include: what it means, diagnostic commands, common causes, rollback procedure, escalation. Link runbook URL from alert annotations. Review quarterly.
📦 Applies To
web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: Low Context: File

✓ schema.org compliant