Incident Response
debt(d9/e7/b7/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints note automated=no and the code_pattern is the absence of process artifacts (no on-call rotation, no runbooks, no severity definitions). The tools listed (pagerduty, opsgenie, statuspage, slack) are coordination tools, not detectors of the gap itself. The absence of incident response process is only discovered when a real incident strikes and chaos ensues — there is no automated or static check that flags its absence.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes defining severity levels with SLAs, an on-call rotation, war room channels, and runbooks for the top 5 incidents. This is not a code change — it is an organizational and process initiative touching team structure, tooling configuration (PagerDuty/OpsGenie), documentation, and stakeholder communication contracts. It spans multiple teams and systems, putting it firmly at e7 rather than e5 (single component) or e9 (full architectural rework).
Closest to 'strong gravitational pull' (b7). The applies_to contexts are web and cli broadly, and the tags include team-process and reliability, signaling this is a cross-cutting operational concern. Without a defined incident response process, every production incident is handled ad-hoc, shaping how all engineering work is structured around on-call duties, postmortems, and runbook maintenance. The common_mistakes confirm ongoing drag: missing incident commanders, skipped postmortems, recurring incidents. Slightly below b9 because it does not fully redefine system architecture.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field explicitly states the canonical wrong belief: that incident response is just fixing the problem as fast as possible. This directly contradicts effective practice, which also requires stakeholder communication, evidence preservation, avoiding hasty fixes, and blameless retrospectives. The common_mistakes reinforce this: teams naturally optimize for speed of fix while ignoring coordination, root cause, communication cadence, and postmortem — all the systemic elements that prevent recurrence.
Also Known As
TL;DR
Explanation
An incident response plan defines: detection (monitoring alerts, user reports), triage (assess severity/impact), containment (isolate affected systems, revoke credentials), eradication (patch root cause, remove attacker foothold), recovery (restore service from clean backups), and lessons learned (blameless post-mortem, improved monitoring). For PHP applications, this means having audit logs that survived the incident, ability to roll back deployments, database snapshots, and clear escalation procedures. Practising incident response through game days and tabletop exercises reduces response time under real pressure.
Diagram
flowchart TD
ALERT[Alert fires] --> TRIAGE[Triage<br/>what is broken?]
TRIAGE --> SEVERITY{Severity}
SEVERITY -->|P1 production down| PAGE[Page on-call<br/>start incident channel]
SEVERITY -->|P2 degraded| NOTIFY[Notify team<br/>monitor closely]
PAGE --> MITIGATE[Mitigate first<br/>rollback or feature flag]
MITIGATE --> COMMS[Customer comms<br/>status page update]
COMMS --> ROOT[Root cause analysis]
ROOT --> FIX[Permanent fix]
FIX --> POSTMORTEM[Blameless postmortem<br/>action items]
style PAGE fill:#f85149,color:#fff
style MITIGATE fill:#d29922,color:#fff
style POSTMORTEM fill:#238636,color:#fff
style FIX fill:#238636,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- No designated incident commander — everyone acts independently with no coordination.
- Fixing the symptom without finding the root cause — the incident recurs.
- No communication cadence during an incident — stakeholders escalate because they have no status updates.
- Skipping the postmortem — without a postmortem, the same incident happens again.
Code Examples
// Incident response anti-pattern:
// 10:00 - Alert fires, three engineers start investigating independently
// 10:15 - Each has a different theory, all making changes simultaneously
// 10:30 - Changes conflict, system state unknown
// 10:45 - Fixed accidentally when one engineer reverted their change
// 11:00 - No postmortem scheduled — 'we'll remember not to do that again'
# Incident response checklist (SEV2 — significant user impact)
# 1. DETECT
# Alert fires in PagerDuty/Opsgenie → on-call acknowledges
# 2. COMMUNICATE
# Post in #incidents Slack: "Investigating elevated 500s on /checkout"
# Update status page: "Investigating"
# 3. CONTAIN
# Roll back last deployment if correlated: kubectl rollout undo
# Toggle off feature flag if related
# 4. DIAGNOSE
# Check: error rates, latency graphs, DB slow query log, recent deploys
# Distributed trace: find the slow/failing span
# 5. RESOLVE
# Apply fix, verify metrics return to baseline
# 6. POST-MORTEM (within 48h)
# Timeline, root cause, action items — blameless