Chaos Engineering
Also Known As
chaos monkey
resilience testing
fault injection
TL;DR
Deliberately injecting failures into a production system to discover weaknesses before they cause unplanned outages.
Explanation
Chaos Engineering (Netflix, Chaos Monkey) proactively tests system resilience by introducing controlled failures: terminating random instances, introducing network latency, exhausting CPU/memory, or cutting off a downstream dependency. The hypothesis-driven approach: define steady-state metrics, hypothesise the system will maintain them under a given failure, inject the failure, observe. Deviations reveal weaknesses. Start in non-production, then graduate to production during low-traffic hours. PHP application chaos experiments: kill a PHP-FPM worker pool, simulate Redis unavailability, introduce 500ms MySQL latency. Tools: Chaos Monkey (AWS), Chaos Toolkit, LitmusChaos (Kubernetes), Gremlin.
Diagram
flowchart TD
HYPO[Define hypothesis<br/>System stays healthy<br/>when service X fails] -->
BASELINE[Measure steady state<br/>normal metrics] -->
INJECT[Inject failure<br/>kill instance, add latency<br/>fill disk] -->
OBSERVE[Observe<br/>metrics, alerts, user impact] -->
COMPARE{Compare to
steady state}
COMPARE -->|resilient| CONFIRM[Hypothesis confirmed<br/>system is robust]
COMPARE -->|degraded| FIND[Weakness found<br/>fix before real incident]
FIND --> IMPROVE[Improve resilience]
IMPROVE --> HYPO
style CONFIRM fill:#238636,color:#fff
style FIND fill:#f85149,color:#fff
style IMPROVE fill:#238636,color:#fff
Watch Out
⚠ Chaos engineering without a defined steady-state hypothesis is just random sabotage — every experiment must state what normal behaviour looks like so you can measure whether the system maintained it.
Common Misconception
✗ Chaos engineering means randomly breaking production to find weaknesses. Chaos engineering follows a scientific method — formulate a hypothesis, define a blast radius, inject a specific failure, observe the system, and improve. Random destruction without hypothesis and measurement is just sabotage.
Why It Matters
Chaos engineering proactively injects failures into production systems to find weaknesses before real incidents do — systems that have never been broken are guaranteed to break unexpectedly.
Common Mistakes
- Running chaos experiments without a hypothesis and metrics — chaos without measurement is just breaking things.
- Starting with production before validating in staging — always start in a controlled environment.
- No abort conditions — running an experiment without a clear 'stop if X happens' risks real outages.
- Treating chaos engineering as a one-time exercise — it should be ongoing as the system evolves.
Avoid When
- Do not run chaos experiments in production without a game day plan, rollback procedure, and an on-call engineer monitoring live.
- Avoid chaos engineering on systems without circuit breakers, retries, or graceful degradation — you will just cause outages, not learn from them.
- Do not run experiments during high-traffic periods or near business-critical events — the blast radius is unpredictable.
When To Use
- Run chaos experiments after you have basic observability (metrics, tracing, alerting) — without it you cannot tell whether an experiment revealed a real weakness.
- Start with known-safe failure modes in staging before targeting production: kill one replica, throttle a dependency, introduce latency.
- Use chaos to validate specific resilience hypotheses: "the system stays available when service X is unavailable" — not just to break things randomly.
Code Examples
💡 Note
The bad code assumes dependencies are always reliable and has no fallback; the good example wraps calls in a circuit breaker so a failure in one service degrades gracefully rather than cascading.
✗ Vulnerable
// No resilience testing — assuming dependencies are reliable:
function getUser(int $id): User {
return $this->userService->fetch($id); // What if this times out?
// No timeout set, no fallback, no circuit breaker
// Chaos test: kill user-service → entire app hangs
}
✓ Fixed
# Chaos engineering — deliberately break things in controlled conditions
# to find weaknesses before they cause incidents
# Principles:
# 1. Define steady state (normal behaviour metrics)
# 2. Hypothesise: 'If we kill one DB, orders still process from replica'
# 3. Run experiment in staging first, then low-traffic production
# 4. Minimise blast radius — run during business hours with engineers ready
# PHP/infrastructure experiments:
# - Kill one Redis node — does session failover work?
# - Slow DB queries by 3s — does the circuit breaker trip?
# - Fill disk — does app fail gracefully or corrupt data?
# - Simulate OOM on PHP-FPM worker — does it restart cleanly?
# Tools: Gremlin, AWS Fault Injection Simulator, Chaos Monkey
# Chaos Toolkit (open source):
$ pip install chaostoolkit
$ chaos run experiments/kill-redis.yaml
Tags
🤝 Adopt this term
£79/year · your link shown here
Added
15 Mar 2026
Edited
19 Apr 2026
Views
19
🤖 AI Guestbook educational data only
|
|
Last 30 days
Agents 0
No pings yet today
No pings yesterday
Amazonbot 7
Perplexity 2
Unknown AI 2
Ahrefs 2
Google 1
Also referenced
How they use it
crawler 13
crawler_json 1
Related categories
⚡
DEV INTEL
Tools & Severity
🔵 Info
⚙ Fix effort: High
⚡ Quick Fix
Start small: randomly kill one non-critical pod in staging, verify the circuit breaker trips and the system degrades gracefully — document failures and fix them before going to production
📦 Applies To
PHP 5.0+
web
api
🔗 Prerequisites
🔍 Detection Hints
No failure injection testing; resilience only verified when real incidents happen; unknown failure modes
Auto-detectable:
✗ No
chaos-monkey
toxiproxy
pumba
chaos-toolkit
⚠ Related Problems
🤖 AI Agent
Confidence: Low
False Positives: High
✗ Manual fix
Fix: High
Context: File