Blameless Post-Mortem
debt(d9/e5/b3/t7)
Closest to 'silent in production until users hit it' (d9), no automated tool detects missing or blame-laden postmortems; per detection_hints.automated: no, this is a process gap that only surfaces when incidents repeat or institutional knowledge is lost.
Closest to 'touches multiple files / significant refactor in one component' (e5), establishing blameless postmortem culture requires templates, scheduled reviews, action-item tracking in tools like Jira/Confluence, and cultural shift — not a one-line fix but contained within the ops/eng process layer.
Closest to 'localised tax' (b3), the practice imposes a recurring ceremony cost on the engineering/ops process but does not shape system architecture; applies_to web/cli broadly but the burden lives in team workflow.
Closest to 'serious trap' (t7), per misconception, the natural reading of 'postmortem' suggests finding who caused the incident, which directly contradicts the blameless intent; engineers reverting to blame produces self-censorship and surface fixes — the obvious approach is wrong.
Also Known As
TL;DR
Explanation
A blameless post-mortem (Google SRE practice) analyses incidents by assuming engineers made reasonable decisions given the information they had. The review documents: timeline, root cause(s), contributing factors, impact, and action items to prevent recurrence. Blame creates a culture where failures are hidden — blamelessness enables honest reporting and systemic improvement. A good post-mortem identifies the failure in the system (missing monitoring, lack of testing, unclear runbook) rather than the human who triggered it. Action items should have owners and deadlines; post-mortems are shared internally to spread learning.
Common Misconception
Why It Matters
Common Mistakes
- Blame-focused postmortems — engineers self-censor and hide information to avoid punishment.
- Postmortem action items with no owner or deadline — they are never completed.
- Only doing postmortems for severe incidents — smaller incidents often contain the same systemic lessons.
- Not sharing postmortems — other teams repeat the same mistakes because they never learned from yours.
Code Examples
# Blame-oriented postmortem (anti-pattern):
Incident: DB outage — 45 min downtime
Root cause: John ran DROP TABLE in production
Action: John received formal warning
# Missing: why was DROP TABLE possible in production?
# Missing: why was there no backup tested recently?
# Blameless version addresses system gaps, not individuals
# Post-mortem template (blameless — focus on systems, not people)
## Incident Summary
- **Date/Duration:** 2024-03-15 14:32–16:10 UTC (98 minutes)
- **Impact:** Checkout unavailable for ~40% of users
- **Severity:** SEV1
## Timeline
- 14:32 Alert: 5xx rate > 5% on /api/orders
- 14:35 On-call acknowledged, began investigation
- 14:50 Identified: new deploy introduced N+1 query → DB CPU 100%
- 15:00 Rolled back deployment → metrics recovering
- 16:10 Fully resolved
## Root Cause
Missing eager-load on order.items introduced in PR #1203.
## Action Items
- [ ] Add query count assertion to integration test suite
- [ ] Configure DB CPU alert threshold
- [ ] Add Debugbar query count check to PR checklist