← Back to glossary

On-Call & Runbooks

Q: What is a common misconception about On-Call & Runbooks?

Runbooks are for junior engineers — runbooks are for everyone at 3am. Even experts make mistakes when exhausted. Runbooks also capture institutional knowledge before it walks out the door.

Q: Why does On-Call & Runbooks matter?

Runbooks reduce mean time to recovery (MTTR) by giving on-call engineers a starting point — without them, each incident starts from scratch and relies on heroics.

Q: How do I fix On-Call & Runbooks?

Create runbook for every alert. Include: what it means, diagnostic commands, common causes, rollback procedure, escalation. Link runbook URL from alert annotations. Review quarterly.

observability Beginner

TL;DR

A runbook documents how to diagnose and resolve specific alerts — on-call engineers shouldn't have to think from scratch at 3am; the runbook provides the playbook.

Explanation

Runbook: step-by-step guide for an alert. Contents: what this alert means, what to check first, common causes, diagnostic commands, escalation path, rollback procedure. Living document — update after every incident. On-call practices: primary + secondary rotation, escalation policy (no response in 15min → secondary), alerting fatigue review (reduce pages < 5/week), post-incident reviews (PIR/blameless postmortem). Runbook hosting: wiki, PagerDuty runbooks, Confluence. Include: expected steady-state values, links to dashboards, known false positives, who to escalate to for domain-specific issues.

Common Misconception

✗ Runbooks are for junior engineers — runbooks are for everyone at 3am. Even experts make mistakes when exhausted. Runbooks also capture institutional knowledge before it walks out the door.

Why It Matters

Runbooks reduce mean time to recovery (MTTR) by giving on-call engineers a starting point — without them, each incident starts from scratch and relies on heroics.

Common Mistakes

No runbook for alerts — on-call has to figure it out every time.
Outdated runbooks — worse than none because they mislead.
Runbooks without diagnostic commands — 'check the database' isn't actionable.
No escalation path — on-call stuck without domain expert contact.

Code Examples

✗ Vulnerable

# Alert without runbook:
alert: HighErrorRate
expr: error_rate > 0.05
annotations:
  summary: Error rate high
  # No runbook link — on-call guesses what to do

✓ Fixed

alert: HighErrorRate
expr: error_rate > 0.05
annotations:
  summary: Error rate above 5%
  runbook: https://wiki/runbooks/high-error-rate
  dashboard: https://grafana/d/xyz
  severity: page

# Runbook content:
# 1. Check dashboard link above
# 2. grep 'ERROR' in logs: `kubectl logs -n prod -l app=api | grep ERROR | head -20`
# 3. Common causes: DB connection pool exhausted, upstream timeout
# 4. Rollback: `kubectl rollout undo deployment/api`
# 5. Escalate to: @backend-oncall in Slack

References

↗ https://sre.google/workbook/on-call/