← Back to glossary

Incident Response

devops PHP 5.0+ Intermediate

debt(d9/e7/b7/t7)

d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints note automated=no and the code_pattern is the absence of process artifacts (no on-call rotation, no runbooks, no severity definitions). The tools listed (pagerduty, opsgenie, statuspage, slack) are coordination tools, not detectors of the gap itself. The absence of incident response process is only discovered when a real incident strikes and chaos ensues — there is no automated or static check that flags its absence.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes defining severity levels with SLAs, an on-call rotation, war room channels, and runbooks for the top 5 incidents. This is not a code change — it is an organizational and process initiative touching team structure, tooling configuration (PagerDuty/OpsGenie), documentation, and stakeholder communication contracts. It spans multiple teams and systems, putting it firmly at e7 rather than e5 (single component) or e9 (full architectural rework).

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). The applies_to contexts are web and cli broadly, and the tags include team-process and reliability, signaling this is a cross-cutting operational concern. Without a defined incident response process, every production incident is handled ad-hoc, shaping how all engineering work is structured around on-call duties, postmortems, and runbook maintenance. The common_mistakes confirm ongoing drag: missing incident commanders, skipped postmortems, recurring incidents. Slightly below b9 because it does not fully redefine system architecture.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field explicitly states the canonical wrong belief: that incident response is just fixing the problem as fast as possible. This directly contradicts effective practice, which also requires stakeholder communication, evidence preservation, avoiding hasty fixes, and blameless retrospectives. The common_mistakes reinforce this: teams naturally optimize for speed of fix while ignoring coordination, root cause, communication cadence, and postmortem — all the systemic elements that prevent recurrence.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-08 · reviewed by human

Also Known As

incident management on-call response outage response

TL;DR

A structured process for detecting, containing, investigating, and recovering from security incidents or system failures.

Explanation

An incident response plan defines: detection (monitoring alerts, user reports), triage (assess severity/impact), containment (isolate affected systems, revoke credentials), eradication (patch root cause, remove attacker foothold), recovery (restore service from clean backups), and lessons learned (blameless post-mortem, improved monitoring). For PHP applications, this means having audit logs that survived the incident, ability to roll back deployments, database snapshots, and clear escalation procedures. Practising incident response through game days and tabletop exercises reduces response time under real pressure.

Diagram

flowchart TD
    ALERT[Alert fires] --> TRIAGE[Triage<br/>what is broken?]
    TRIAGE --> SEVERITY{Severity}
    SEVERITY -->|P1 production down| PAGE[Page on-call<br/>start incident channel]
    SEVERITY -->|P2 degraded| NOTIFY[Notify team<br/>monitor closely]
    PAGE --> MITIGATE[Mitigate first<br/>rollback or feature flag]
    MITIGATE --> COMMS[Customer comms<br/>status page update]
    COMMS --> ROOT[Root cause analysis]
    ROOT --> FIX[Permanent fix]
    FIX --> POSTMORTEM[Blameless postmortem<br/>action items]
style PAGE fill:#f85149,color:#fff
style MITIGATE fill:#d29922,color:#fff
style POSTMORTEM fill:#238636,color:#fff
style FIX fill:#238636,color:#fff

Common Misconception

✗ Incident response is just fixing the problem as fast as possible. Effective incident response also includes clear communication to stakeholders, preserving evidence for post-incident review, avoiding further damage from hasty fixes, and blameless retrospectives that improve systemic resilience.

Why It Matters

A defined incident response process reduces time to recovery — without it, teams duplicate effort, miss communication steps, and spend hours in chaos instead of minutes resolving the issue.

Common Mistakes

No designated incident commander — everyone acts independently with no coordination.
Fixing the symptom without finding the root cause — the incident recurs.
No communication cadence during an incident — stakeholders escalate because they have no status updates.
Skipping the postmortem — without a postmortem, the same incident happens again.

Code Examples

✗ Vulnerable

// Incident response anti-pattern:
// 10:00 - Alert fires, three engineers start investigating independently
// 10:15 - Each has a different theory, all making changes simultaneously
// 10:30 - Changes conflict, system state unknown
// 10:45 - Fixed accidentally when one engineer reverted their change
// 11:00 - No postmortem scheduled — 'we'll remember not to do that again'

✓ Fixed

# Incident response checklist (SEV2 — significant user impact)

# 1. DETECT
#    Alert fires in PagerDuty/Opsgenie → on-call acknowledges

# 2. COMMUNICATE
#    Post in #incidents Slack: "Investigating elevated 500s on /checkout"
#    Update status page: "Investigating"

# 3. CONTAIN
#    Roll back last deployment if correlated: kubectl rollout undo
#    Toggle off feature flag if related

# 4. DIAGNOSE
#    Check: error rates, latency graphs, DB slow query log, recent deploys
#    Distributed trace: find the slow/failing span

# 5. RESOLVE
#    Apply fix, verify metrics return to baseline

# 6. POST-MORTEM (within 48h)
#    Timeline, root cause, action items — blameless

References

↗ https://www.nist.gov/publications/computer-security-incident-handling-guide