← Back to glossary

On-Call Culture & Runbooks

DevOps Intermediate

debt(d7/e7/b7/t7)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints note that tools like PagerDuty and OpsGenie exist but automated detection is explicitly 'no'. The signs — missing runbooks, recurring incidents, burnout — only surface through careful operational observation over time (e.g. reviewing alert volumes, postmortem outcomes, attrition). No linter or static tool catches cultural debt; it manifests in production through repeated incidents and engineer exhaustion.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix suggests tracking and reducing toil, but the common_mistakes reveal systemic problems: no runbooks to write, rotation imbalances to fix, compensation structures to change, and postmortem culture to reform. These changes span engineering teams, management, HR, and operational tooling — well beyond a single-component fix. This is a cultural and organizational refactor, not a code change.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). On-call culture applies across all engineers in web and cli contexts per applies_to. A poor on-call culture shapes every future hire, every system design decision (reliability targets, alert thresholds), every incident response, and every postmortem. It imposes a persistent, cross-team productivity and morale tax. It doesn't quite define the entire system's shape (b9) but it strongly influences how every operational change is made.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The canonical misconception is explicit: 'More alerts means better monitoring.' This directly contradicts reasonable intuition — developers naturally assume more monitoring signals more safety. In practice, alert fatigue means more alerts can reduce effective monitoring. This is a well-documented SRE gotcha that contradicts intuitions imported from adjacent practices (logging, metrics), scoring it at t7 rather than t9 because it is documented and increasingly well-known in the SRE community.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-11 · reviewed by human

Also Known As

on-call PagerDuty blameless postmortem MTTR MTTA

TL;DR

Sustainable on-call practices — fair rotation, blameless postmortems, actionable alerts, and well-maintained runbooks that reduce mean time to recovery and prevent burnout.

Explanation

Healthy on-call culture requires: fair rotation (spread the load, compensate for on-call time), actionable alerts (every alert requires human action — no noise), runbooks for every alert (responder should never need to improvise), blameless postmortems (incidents are systemic failures, not individual failures), and time-boxed escalation (know when to escalate). Engineering metrics: MTTA (Mean Time to Acknowledge < 15 minutes), MTTR (Mean Time to Recover), alert volume per on-call shift. Red flags: >10 alerts per night, same incident recurring, on-call team burning out.

Common Misconception

✗ More alerts means better monitoring — too many alerts cause fatigue and are ignored; the goal is the minimum number of high-signal alerts that each require human action.

Why It Matters

An on-call engineer who receives 50 alerts per night for two weeks burns out — sustainable on-call is a prerequisite for retaining experienced engineers.

Common Mistakes

No runbooks — responders improvise under pressure, increasing MTTR and mistakes.
Same engineer on-call every week — burnout and bus factor.
No compensation for on-call time — engineers resent the additional burden.
Postmortems that blame individuals — systemic fixes prevent recurrence; blame does not.

Code Examples

✗ Vulnerable

# Unsustainable on-call:
# Alert: CPU > 70% for 1 minute — pages on-call
# Alert: any 500 error — pages on-call
# Alert: disk > 80% — pages on-call
# Alert: memory > 75% — pages on-call
# Result: 40 pages per night, all noise
# MTTR for real incidents: 2 hours (fatigue + no runbooks)
# Engineer turnover: high

✓ Fixed

# Sustainable on-call:
# Alert criteria: user-impacting only, 3-month review to prune noise
# Every alert: links to runbook with exact diagnostic steps
# Rotation: weekly, 5 engineers, no back-to-back
# Compensation: 1 day off per on-call week
# Postmortem: every P1/P2, blameless, action items tracked
# MTTA: < 5 minutes | MTTR: < 30 minutes
# Alert volume: < 5 pages per shift
# Engineer turnover: low

Tags

Added 16 Mar 2026

Edited 22 Mar 2026

Curated in Warsaw under one editorial standard. 1,506 terms, single voice. About this reference →

Rate this term

No ratings yet

🤖 AI Guestbook educational data only

| |

Last 30 days

Agents 1

Claude 1

No pings yesterday

Amazonbot 14 Google 7 Perplexity 6 Ahrefs 4 Meta AI 3 Unknown AI 3 SEMrush 3 ChatGPT 3 Scrapy 3 Majestic 2 Claude 2 Bing 1

Also referenced

Incident Response 67 Alert Fatigue 43 Blameless Culture 39 Runbooks & Playbooks 38

How they use it

crawler 44 crawler_json 4 pre-tracking 3

Related categories

general 3k devops 2.2k observability 1.7k

⚡ DEV INTEL Tools & Severity

🟡 Medium ⚙ Fix effort: High

⚡ Quick Fix

Track and reduce on-call toil (alerts that fire but don't need action) — if someone is paged more than 2-3 times per shift, the system needs fixing not the person

📦 Applies To

any web cli

🔗 Prerequisites

Incident Response Alert Fatigue Blameless Post-Mortem

🔍 Detection Hints

No on-call rotation documentation; runbooks missing; same incidents recurring without follow-up; engineers burning out from overnight pages

Auto-detectable: ✗ No pagerduty opsgenie

⚠ Related Problems

Alert Fatigue Incident Response Blameless Culture

🤖 AI Agent

Confidence: Low False Positives: High ✗ Manual fix Fix: Medium Context: File

References

https://sre.google/sre-book/being-on-call/