← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

On-Call Culture & Runbooks

DevOps Intermediate
debt(d7/e7/b7/t7)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints note that tools like PagerDuty and OpsGenie exist but automated detection is explicitly 'no'. The signs — missing runbooks, recurring incidents, burnout — only surface through careful operational observation over time (e.g. reviewing alert volumes, postmortem outcomes, attrition). No linter or static tool catches cultural debt; it manifests in production through repeated incidents and engineer exhaustion.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix suggests tracking and reducing toil, but the common_mistakes reveal systemic problems: no runbooks to write, rotation imbalances to fix, compensation structures to change, and postmortem culture to reform. These changes span engineering teams, management, HR, and operational tooling — well beyond a single-component fix. This is a cultural and organizational refactor, not a code change.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). On-call culture applies across all engineers in web and cli contexts per applies_to. A poor on-call culture shapes every future hire, every system design decision (reliability targets, alert thresholds), every incident response, and every postmortem. It imposes a persistent, cross-team productivity and morale tax. It doesn't quite define the entire system's shape (b9) but it strongly influences how every operational change is made.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The canonical misconception is explicit: 'More alerts means better monitoring.' This directly contradicts reasonable intuition — developers naturally assume more monitoring signals more safety. In practice, alert fatigue means more alerts can reduce effective monitoring. This is a well-documented SRE gotcha that contradicts intuitions imported from adjacent practices (logging, metrics), scoring it at t7 rather than t9 because it is documented and increasingly well-known in the SRE community.

About DEBT scoring →

Also Known As

on-call PagerDuty blameless postmortem MTTR MTTA

TL;DR

Sustainable on-call practices — fair rotation, blameless postmortems, actionable alerts, and well-maintained runbooks that reduce mean time to recovery and prevent burnout.

Explanation

Healthy on-call culture requires: fair rotation (spread the load, compensate for on-call time), actionable alerts (every alert requires human action — no noise), runbooks for every alert (responder should never need to improvise), blameless postmortems (incidents are systemic failures, not individual failures), and time-boxed escalation (know when to escalate). Engineering metrics: MTTA (Mean Time to Acknowledge < 15 minutes), MTTR (Mean Time to Recover), alert volume per on-call shift. Red flags: >10 alerts per night, same incident recurring, on-call team burning out.

Common Misconception

More alerts means better monitoring — too many alerts cause fatigue and are ignored; the goal is the minimum number of high-signal alerts that each require human action.

Why It Matters

An on-call engineer who receives 50 alerts per night for two weeks burns out — sustainable on-call is a prerequisite for retaining experienced engineers.

Common Mistakes

  • No runbooks — responders improvise under pressure, increasing MTTR and mistakes.
  • Same engineer on-call every week — burnout and bus factor.
  • No compensation for on-call time — engineers resent the additional burden.
  • Postmortems that blame individuals — systemic fixes prevent recurrence; blame does not.

Code Examples

✗ Vulnerable
# Unsustainable on-call:
# Alert: CPU > 70% for 1 minute — pages on-call
# Alert: any 500 error — pages on-call
# Alert: disk > 80% — pages on-call
# Alert: memory > 75% — pages on-call
# Result: 40 pages per night, all noise
# MTTR for real incidents: 2 hours (fatigue + no runbooks)
# Engineer turnover: high
✓ Fixed
# Sustainable on-call:
# Alert criteria: user-impacting only, 3-month review to prune noise
# Every alert: links to runbook with exact diagnostic steps
# Rotation: weekly, 5 engineers, no back-to-back
# Compensation: 1 day off per on-call week
# Postmortem: every P1/P2, blameless, action items tracked
# MTTA: < 5 minutes | MTTR: < 30 minutes
# Alert volume: < 5 pages per shift
# Engineer turnover: low

Added 16 Mar 2026
Edited 22 Mar 2026
Views 58
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 1 ping W 1 ping T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 1 ping F 0 pings S 0 pings S 0 pings M 1 ping T 0 pings W 1 ping T 1 ping F 1 ping S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 1 ping F 1 ping S 1 ping S 1 ping M 0 pings T 0 pings W
No pings yet today
No pings yesterday
Amazonbot 14 Google 7 Perplexity 6 Ahrefs 4 Meta AI 3 Unknown AI 3 SEMrush 3 ChatGPT 3 Scrapy 3 Majestic 2 Claude 1 Bing 1
crawler 43 crawler_json 4 pre-tracking 3
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: High
⚡ Quick Fix
Track and reduce on-call toil (alerts that fire but don't need action) — if someone is paged more than 2-3 times per shift, the system needs fixing not the person
📦 Applies To
any web cli
🔗 Prerequisites
🔍 Detection Hints
No on-call rotation documentation; runbooks missing; same incidents recurring without follow-up; engineers burning out from overnight pages
Auto-detectable: ✗ No pagerduty opsgenie
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: Medium Context: File


✓ schema.org compliant