On-Call Culture & Runbooks
Also Known As
on-call
PagerDuty
blameless postmortem
MTTR
MTTA
TL;DR
Sustainable on-call practices — fair rotation, blameless postmortems, actionable alerts, and well-maintained runbooks that reduce mean time to recovery and prevent burnout.
Explanation
Healthy on-call culture requires: fair rotation (spread the load, compensate for on-call time), actionable alerts (every alert requires human action — no noise), runbooks for every alert (responder should never need to improvise), blameless postmortems (incidents are systemic failures, not individual failures), and time-boxed escalation (know when to escalate). Engineering metrics: MTTA (Mean Time to Acknowledge < 15 minutes), MTTR (Mean Time to Recover), alert volume per on-call shift. Red flags: >10 alerts per night, same incident recurring, on-call team burning out.
Common Misconception
✗ More alerts means better monitoring — too many alerts cause fatigue and are ignored; the goal is the minimum number of high-signal alerts that each require human action.
Why It Matters
An on-call engineer who receives 50 alerts per night for two weeks burns out — sustainable on-call is a prerequisite for retaining experienced engineers.
Common Mistakes
- No runbooks — responders improvise under pressure, increasing MTTR and mistakes.
- Same engineer on-call every week — burnout and bus factor.
- No compensation for on-call time — engineers resent the additional burden.
- Postmortems that blame individuals — systemic fixes prevent recurrence; blame does not.
Code Examples
✗ Vulnerable
# Unsustainable on-call:
# Alert: CPU > 70% for 1 minute — pages on-call
# Alert: any 500 error — pages on-call
# Alert: disk > 80% — pages on-call
# Alert: memory > 75% — pages on-call
# Result: 40 pages per night, all noise
# MTTR for real incidents: 2 hours (fatigue + no runbooks)
# Engineer turnover: high
✓ Fixed
# Sustainable on-call:
# Alert criteria: user-impacting only, 3-month review to prune noise
# Every alert: links to runbook with exact diagnostic steps
# Rotation: weekly, 5 engineers, no back-to-back
# Compensation: 1 day off per on-call week
# Postmortem: every P1/P2, blameless, action items tracked
# MTTA: < 5 minutes | MTTR: < 30 minutes
# Alert volume: < 5 pages per shift
# Engineer turnover: low
References
Tags
🤝 Adopt this term
£79/year · your link shown here
Added
16 Mar 2026
Edited
22 Mar 2026
Views
32
🤖 AI Guestbook educational data only
|
|
Last 30 days
Agents 1
No pings yesterday
Amazonbot 13
Perplexity 5
Google 4
Meta AI 2
Ahrefs 2
Unknown AI 2
Majestic 1
How they use it
crawler 25
crawler_json 1
pre-tracking 3
Related categories
⚡
DEV INTEL
Tools & Severity
🟡 Medium
⚙ Fix effort: High
⚡ Quick Fix
Track and reduce on-call toil (alerts that fire but don't need action) — if someone is paged more than 2-3 times per shift, the system needs fixing not the person
📦 Applies To
any
web
cli
🔗 Prerequisites
🔍 Detection Hints
No on-call rotation documentation; runbooks missing; same incidents recurring without follow-up; engineers burning out from overnight pages
Auto-detectable:
✗ No
pagerduty
opsgenie
⚠ Related Problems
🤖 AI Agent
Confidence: Low
False Positives: High
✗ Manual fix
Fix: Medium
Context: File