{
    "slug": "chaos_engineering",
    "term": "Chaos Engineering",
    "category": "devops",
    "difficulty": "advanced",
    "short": "Deliberately injecting failures into a production system to discover weaknesses before they cause unplanned outages.",
    "long": "Chaos Engineering (Netflix, Chaos Monkey) proactively tests system resilience by introducing controlled failures: terminating random instances, introducing network latency, exhausting CPU/memory, or cutting off a downstream dependency. The hypothesis-driven approach: define steady-state metrics, hypothesise the system will maintain them under a given failure, inject the failure, observe. Deviations reveal weaknesses. Start in non-production, then graduate to production during low-traffic hours. PHP application chaos experiments: kill a PHP-FPM worker pool, simulate Redis unavailability, introduce 500ms MySQL latency. Tools: Chaos Monkey (AWS), Chaos Toolkit, LitmusChaos (Kubernetes), Gremlin.",
    "aliases": [
        "chaos monkey",
        "resilience testing",
        "fault injection"
    ],
    "tags": [
        "devops",
        "testing",
        "resilience",
        "reliability"
    ],
    "misconception": "Chaos engineering means randomly breaking production to find weaknesses. Chaos engineering follows a scientific method — formulate a hypothesis, define a blast radius, inject a specific failure, observe the system, and improve. Random destruction without hypothesis and measurement is just sabotage.",
    "why_it_matters": "Chaos engineering proactively injects failures into production systems to find weaknesses before real incidents do — systems that have never been broken are guaranteed to break unexpectedly.",
    "common_mistakes": [
        "Running chaos experiments without a hypothesis and metrics — chaos without measurement is just breaking things.",
        "Starting with production before validating in staging — always start in a controlled environment.",
        "No abort conditions — running an experiment without a clear 'stop if X happens' risks real outages.",
        "Treating chaos engineering as a one-time exercise — it should be ongoing as the system evolves."
    ],
    "when_to_use": [
        "Run chaos experiments after you have basic observability (metrics, tracing, alerting) — without it you cannot tell whether an experiment revealed a real weakness.",
        "Start with known-safe failure modes in staging before targeting production: kill one replica, throttle a dependency, introduce latency.",
        "Use chaos to validate specific resilience hypotheses: \"the system stays available when service X is unavailable\" — not just to break things randomly."
    ],
    "avoid_when": [
        "Do not run chaos experiments in production without a game day plan, rollback procedure, and an on-call engineer monitoring live.",
        "Avoid chaos engineering on systems without circuit breakers, retries, or graceful degradation — you will just cause outages, not learn from them.",
        "Do not run experiments during high-traffic periods or near business-critical events — the blast radius is unpredictable."
    ],
    "related": [
        "observability",
        "sla_slo_sre",
        "health_check",
        "circuit_breaker"
    ],
    "prerequisites": [
        "circuit_breaker",
        "retry_pattern",
        "observability"
    ],
    "refs": [
        "https://principlesofchaos.org/",
        "https://netflix.github.io/chaosmonkey/"
    ],
    "bad_code": "// No resilience testing — assuming dependencies are reliable:\nfunction getUser(int $id): User {\n    return $this->userService->fetch($id); // What if this times out?\n    // No timeout set, no fallback, no circuit breaker\n    // Chaos test: kill user-service → entire app hangs\n}",
    "good_code": "# Chaos engineering — deliberately break things in controlled conditions\n# to find weaknesses before they cause incidents\n\n# Principles:\n# 1. Define steady state (normal behaviour metrics)\n# 2. Hypothesise: 'If we kill one DB, orders still process from replica'\n# 3. Run experiment in staging first, then low-traffic production\n# 4. Minimise blast radius — run during business hours with engineers ready\n\n# PHP/infrastructure experiments:\n# - Kill one Redis node — does session failover work?\n# - Slow DB queries by 3s — does the circuit breaker trip?\n# - Fill disk — does app fail gracefully or corrupt data?\n# - Simulate OOM on PHP-FPM worker — does it restart cleanly?\n\n# Tools: Gremlin, AWS Fault Injection Simulator, Chaos Monkey\n# Chaos Toolkit (open source):\n$ pip install chaostoolkit\n$ chaos run experiments/kill-redis.yaml",
    "example_note": "The bad code assumes dependencies are always reliable and has no fallback; the good example wraps calls in a circuit breaker so a failure in one service degrades gracefully rather than cascading.",
    "quick_fix": "Start small: randomly kill one non-critical pod in staging, verify the circuit breaker trips and the system degrades gracefully — document failures and fix them before going to production",
    "severity": "info",
    "effort": "high",
    "created": "2026-03-15",
    "updated": "2026-04-19",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/chaos_engineering",
        "html_url": "https://codeclaritylab.com/glossary/chaos_engineering",
        "json_url": "https://codeclaritylab.com/glossary/chaos_engineering.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Chaos Engineering](https://codeclaritylab.com/glossary/chaos_engineering) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/chaos_engineering"
            }
        }
    }
}