On-Call Culture & Runbooks
debt(d7/e7/b7/t7)
Closest to 'only careful code review or runtime testing' (d7). The detection_hints note that tools like PagerDuty and OpsGenie exist but automated detection is explicitly 'no'. The signs — missing runbooks, recurring incidents, burnout — only surface through careful operational observation over time (e.g. reviewing alert volumes, postmortem outcomes, attrition). No linter or static tool catches cultural debt; it manifests in production through repeated incidents and engineer exhaustion.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix suggests tracking and reducing toil, but the common_mistakes reveal systemic problems: no runbooks to write, rotation imbalances to fix, compensation structures to change, and postmortem culture to reform. These changes span engineering teams, management, HR, and operational tooling — well beyond a single-component fix. This is a cultural and organizational refactor, not a code change.
Closest to 'strong gravitational pull' (b7). On-call culture applies across all engineers in web and cli contexts per applies_to. A poor on-call culture shapes every future hire, every system design decision (reliability targets, alert thresholds), every incident response, and every postmortem. It imposes a persistent, cross-team productivity and morale tax. It doesn't quite define the entire system's shape (b9) but it strongly influences how every operational change is made.
Closest to 'serious trap' (t7). The canonical misconception is explicit: 'More alerts means better monitoring.' This directly contradicts reasonable intuition — developers naturally assume more monitoring signals more safety. In practice, alert fatigue means more alerts can reduce effective monitoring. This is a well-documented SRE gotcha that contradicts intuitions imported from adjacent practices (logging, metrics), scoring it at t7 rather than t9 because it is documented and increasingly well-known in the SRE community.
Also Known As
TL;DR
Explanation
Healthy on-call culture requires: fair rotation (spread the load, compensate for on-call time), actionable alerts (every alert requires human action — no noise), runbooks for every alert (responder should never need to improvise), blameless postmortems (incidents are systemic failures, not individual failures), and time-boxed escalation (know when to escalate). Engineering metrics: MTTA (Mean Time to Acknowledge < 15 minutes), MTTR (Mean Time to Recover), alert volume per on-call shift. Red flags: >10 alerts per night, same incident recurring, on-call team burning out.
Common Misconception
Why It Matters
Common Mistakes
- No runbooks — responders improvise under pressure, increasing MTTR and mistakes.
- Same engineer on-call every week — burnout and bus factor.
- No compensation for on-call time — engineers resent the additional burden.
- Postmortems that blame individuals — systemic fixes prevent recurrence; blame does not.
Code Examples
# Unsustainable on-call:
# Alert: CPU > 70% for 1 minute — pages on-call
# Alert: any 500 error — pages on-call
# Alert: disk > 80% — pages on-call
# Alert: memory > 75% — pages on-call
# Result: 40 pages per night, all noise
# MTTR for real incidents: 2 hours (fatigue + no runbooks)
# Engineer turnover: high
# Sustainable on-call:
# Alert criteria: user-impacting only, 3-month review to prune noise
# Every alert: links to runbook with exact diagnostic steps
# Rotation: weekly, 5 engineers, no back-to-back
# Compensation: 1 day off per on-call week
# Postmortem: every P1/P2, blameless, action items tracked
# MTTA: < 5 minutes | MTTR: < 30 minutes
# Alert volume: < 5 pages per shift
# Engineer turnover: low