Chaos Engineering
debt(d9/e7/b5/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints note 'automated: no' and the code_pattern is 'No failure injection testing; resilience only verified when real incidents happen; unknown failure modes.' The absence of chaos engineering is entirely invisible — no compiler, linter, or SAST tool flags it. Tools like chaos-monkey, toxiproxy, pumba, and chaos-toolkit must be deliberately adopted; nothing in the normal development pipeline warns you that resilience is unverified. You only discover the gap when a real incident hits production.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes a starting point (kill one pod in staging, verify circuit breaker), but this is just the entry point. Properly adopting chaos engineering requires establishing observability infrastructure, defining hypotheses and blast radii, setting up abort conditions, integrating experiments into CI/CD pipelines, and evolving experiments as the system changes. The common_mistakes list (no hypothesis, no metrics, no abort conditions, not ongoing) shows this is a cross-cutting operational and architectural concern that touches monitoring, deployment pipelines, and resilience patterns across the codebase. It is not architectural rework of the application itself, but it is a significant cross-cutting investment.
Closest to 'persistent productivity tax' (b5). Once adopted, chaos engineering imposes an ongoing operational discipline — every new service, dependency, or deployment must be considered for experiment coverage. The common_mistake 'treating chaos engineering as a one-time exercise' confirms it must be maintained as the system evolves. It applies to web and API contexts broadly. However, it does not define the system's shape or force every individual code change to account for it, so it sits at b5 rather than b7.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception is explicit and significant: most developers hear 'chaos engineering' and think it means randomly breaking production systems to find weaknesses. The actual practice is a disciplined scientific method — hypothesis, controlled blast radius, specific failure injection, measurement, and improvement. This directly contradicts the intuitive reading of 'chaos,' which implies randomness and disorder. A competent developer unfamiliar with the term will confidently guess wrong, potentially causing real outages by running uncontrolled experiments, which maps to the common_mistakes about no hypothesis, no abort conditions, and skipping staging.
Also Known As
TL;DR
Explanation
Chaos Engineering (Netflix, Chaos Monkey) proactively tests system resilience by introducing controlled failures: terminating random instances, introducing network latency, exhausting CPU/memory, or cutting off a downstream dependency. The hypothesis-driven approach: define steady-state metrics, hypothesise the system will maintain them under a given failure, inject the failure, observe. Deviations reveal weaknesses. Start in non-production, then graduate to production during low-traffic hours. PHP application chaos experiments: kill a PHP-FPM worker pool, simulate Redis unavailability, introduce 500ms MySQL latency. Tools: Chaos Monkey (AWS), Chaos Toolkit, LitmusChaos (Kubernetes), Gremlin.
Diagram
flowchart TD
HYPO[Define hypothesis<br/>System stays healthy<br/>when service X fails] -->
BASELINE[Measure steady state<br/>normal metrics] -->
INJECT[Inject failure<br/>kill instance, add latency<br/>fill disk] -->
OBSERVE[Observe<br/>metrics, alerts, user impact] -->
COMPARE{Compare to
steady state}
COMPARE -->|resilient| CONFIRM[Hypothesis confirmed<br/>system is robust]
COMPARE -->|degraded| FIND[Weakness found<br/>fix before real incident]
FIND --> IMPROVE[Improve resilience]
IMPROVE --> HYPO
style CONFIRM fill:#238636,color:#fff
style FIND fill:#f85149,color:#fff
style IMPROVE fill:#238636,color:#fff
Watch Out
Common Misconception
Why It Matters
Common Mistakes
- Running chaos experiments without a hypothesis and metrics — chaos without measurement is just breaking things.
- Starting with production before validating in staging — always start in a controlled environment.
- No abort conditions — running an experiment without a clear 'stop if X happens' risks real outages.
- Treating chaos engineering as a one-time exercise — it should be ongoing as the system evolves.
Avoid When
- Do not run chaos experiments in production without a game day plan, rollback procedure, and an on-call engineer monitoring live.
- Avoid chaos engineering on systems without circuit breakers, retries, or graceful degradation — you will just cause outages, not learn from them.
- Do not run experiments during high-traffic periods or near business-critical events — the blast radius is unpredictable.
When To Use
- Run chaos experiments after you have basic observability (metrics, tracing, alerting) — without it you cannot tell whether an experiment revealed a real weakness.
- Start with known-safe failure modes in staging before targeting production: kill one replica, throttle a dependency, introduce latency.
- Use chaos to validate specific resilience hypotheses: "the system stays available when service X is unavailable" — not just to break things randomly.
Code Examples
// No resilience testing — assuming dependencies are reliable:
function getUser(int $id): User {
return $this->userService->fetch($id); // What if this times out?
// No timeout set, no fallback, no circuit breaker
// Chaos test: kill user-service → entire app hangs
}
# Chaos engineering — deliberately break things in controlled conditions
# to find weaknesses before they cause incidents
# Principles:
# 1. Define steady state (normal behaviour metrics)
# 2. Hypothesise: 'If we kill one DB, orders still process from replica'
# 3. Run experiment in staging first, then low-traffic production
# 4. Minimise blast radius — run during business hours with engineers ready
# PHP/infrastructure experiments:
# - Kill one Redis node — does session failover work?
# - Slow DB queries by 3s — does the circuit breaker trip?
# - Fill disk — does app fail gracefully or corrupt data?
# - Simulate OOM on PHP-FPM worker — does it restart cleanly?
# Tools: Gremlin, AWS Fault Injection Simulator, Chaos Monkey
# Chaos Toolkit (open source):
$ pip install chaostoolkit
$ chaos run experiments/kill-redis.yaml