← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

Chaos Engineering

DevOps PHP 5.0+ Advanced
debt(d9/e7/b5/t7)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints note 'automated: no' and the code_pattern is 'No failure injection testing; resilience only verified when real incidents happen; unknown failure modes.' The absence of chaos engineering is entirely invisible — no compiler, linter, or SAST tool flags it. Tools like chaos-monkey, toxiproxy, pumba, and chaos-toolkit must be deliberately adopted; nothing in the normal development pipeline warns you that resilience is unverified. You only discover the gap when a real incident hits production.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes a starting point (kill one pod in staging, verify circuit breaker), but this is just the entry point. Properly adopting chaos engineering requires establishing observability infrastructure, defining hypotheses and blast radii, setting up abort conditions, integrating experiments into CI/CD pipelines, and evolving experiments as the system changes. The common_mistakes list (no hypothesis, no metrics, no abort conditions, not ongoing) shows this is a cross-cutting operational and architectural concern that touches monitoring, deployment pipelines, and resilience patterns across the codebase. It is not architectural rework of the application itself, but it is a significant cross-cutting investment.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). Once adopted, chaos engineering imposes an ongoing operational discipline — every new service, dependency, or deployment must be considered for experiment coverage. The common_mistake 'treating chaos engineering as a one-time exercise' confirms it must be maintained as the system evolves. It applies to web and API contexts broadly. However, it does not define the system's shape or force every individual code change to account for it, so it sits at b5 rather than b7.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception is explicit and significant: most developers hear 'chaos engineering' and think it means randomly breaking production systems to find weaknesses. The actual practice is a disciplined scientific method — hypothesis, controlled blast radius, specific failure injection, measurement, and improvement. This directly contradicts the intuitive reading of 'chaos,' which implies randomness and disorder. A competent developer unfamiliar with the term will confidently guess wrong, potentially causing real outages by running uncontrolled experiments, which maps to the common_mistakes about no hypothesis, no abort conditions, and skipping staging.

About DEBT scoring →

Also Known As

chaos monkey resilience testing fault injection

TL;DR

Deliberately injecting failures into a production system to discover weaknesses before they cause unplanned outages.

Explanation

Chaos Engineering (Netflix, Chaos Monkey) proactively tests system resilience by introducing controlled failures: terminating random instances, introducing network latency, exhausting CPU/memory, or cutting off a downstream dependency. The hypothesis-driven approach: define steady-state metrics, hypothesise the system will maintain them under a given failure, inject the failure, observe. Deviations reveal weaknesses. Start in non-production, then graduate to production during low-traffic hours. PHP application chaos experiments: kill a PHP-FPM worker pool, simulate Redis unavailability, introduce 500ms MySQL latency. Tools: Chaos Monkey (AWS), Chaos Toolkit, LitmusChaos (Kubernetes), Gremlin.

Diagram

flowchart TD
    HYPO[Define hypothesis<br/>System stays healthy<br/>when service X fails] -->
    BASELINE[Measure steady state<br/>normal metrics] -->
    INJECT[Inject failure<br/>kill instance, add latency<br/>fill disk] -->
    OBSERVE[Observe<br/>metrics, alerts, user impact] -->
    COMPARE{Compare to
steady state}
    COMPARE -->|resilient| CONFIRM[Hypothesis confirmed<br/>system is robust]
    COMPARE -->|degraded| FIND[Weakness found<br/>fix before real incident]
    FIND --> IMPROVE[Improve resilience]
    IMPROVE --> HYPO
style CONFIRM fill:#238636,color:#fff
style FIND fill:#f85149,color:#fff
style IMPROVE fill:#238636,color:#fff

Watch Out

Chaos engineering without a defined steady-state hypothesis is just random sabotage — every experiment must state what normal behaviour looks like so you can measure whether the system maintained it.

Common Misconception

Chaos engineering means randomly breaking production to find weaknesses. Chaos engineering follows a scientific method — formulate a hypothesis, define a blast radius, inject a specific failure, observe the system, and improve. Random destruction without hypothesis and measurement is just sabotage.

Why It Matters

Chaos engineering proactively injects failures into production systems to find weaknesses before real incidents do — systems that have never been broken are guaranteed to break unexpectedly.

Common Mistakes

  • Running chaos experiments without a hypothesis and metrics — chaos without measurement is just breaking things.
  • Starting with production before validating in staging — always start in a controlled environment.
  • No abort conditions — running an experiment without a clear 'stop if X happens' risks real outages.
  • Treating chaos engineering as a one-time exercise — it should be ongoing as the system evolves.

Avoid When

  • Do not run chaos experiments in production without a game day plan, rollback procedure, and an on-call engineer monitoring live.
  • Avoid chaos engineering on systems without circuit breakers, retries, or graceful degradation — you will just cause outages, not learn from them.
  • Do not run experiments during high-traffic periods or near business-critical events — the blast radius is unpredictable.

When To Use

  • Run chaos experiments after you have basic observability (metrics, tracing, alerting) — without it you cannot tell whether an experiment revealed a real weakness.
  • Start with known-safe failure modes in staging before targeting production: kill one replica, throttle a dependency, introduce latency.
  • Use chaos to validate specific resilience hypotheses: "the system stays available when service X is unavailable" — not just to break things randomly.

Code Examples

💡 Note
The bad code assumes dependencies are always reliable and has no fallback; the good example wraps calls in a circuit breaker so a failure in one service degrades gracefully rather than cascading.
✗ Vulnerable
// No resilience testing — assuming dependencies are reliable:
function getUser(int $id): User {
    return $this->userService->fetch($id); // What if this times out?
    // No timeout set, no fallback, no circuit breaker
    // Chaos test: kill user-service → entire app hangs
}
✓ Fixed
# Chaos engineering — deliberately break things in controlled conditions
# to find weaknesses before they cause incidents

# Principles:
# 1. Define steady state (normal behaviour metrics)
# 2. Hypothesise: 'If we kill one DB, orders still process from replica'
# 3. Run experiment in staging first, then low-traffic production
# 4. Minimise blast radius — run during business hours with engineers ready

# PHP/infrastructure experiments:
# - Kill one Redis node — does session failover work?
# - Slow DB queries by 3s — does the circuit breaker trip?
# - Fill disk — does app fail gracefully or corrupt data?
# - Simulate OOM on PHP-FPM worker — does it restart cleanly?

# Tools: Gremlin, AWS Fault Injection Simulator, Chaos Monkey
# Chaos Toolkit (open source):
$ pip install chaostoolkit
$ chaos run experiments/kill-redis.yaml

Added 15 Mar 2026
Edited 19 Apr 2026
Views 40
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 1 ping W 1 ping T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 2 pings T 0 pings F 0 pings S 0 pings S 0 pings M 1 ping T 0 pings W 0 pings T 0 pings F 1 ping S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 1 ping T 0 pings W
No pings yet today
Google 1
Amazonbot 9 Ahrefs 4 SEMrush 3 Perplexity 2 Unknown AI 2 Google 2 Claude 2 Scrapy 2 Majestic 1 Meta AI 1
crawler 25 crawler_json 3
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Start small: randomly kill one non-critical pod in staging, verify the circuit breaker trips and the system degrades gracefully — document failures and fix them before going to production
📦 Applies To
PHP 5.0+ web api
🔗 Prerequisites
🔍 Detection Hints
No failure injection testing; resilience only verified when real incidents happen; unknown failure modes
Auto-detectable: ✗ No chaos-monkey toxiproxy pumba chaos-toolkit
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: High Context: File


✓ schema.org compliant