When should you NOT use Chaos Engineering?

Do not run chaos experiments in production without a game day plan, rollback procedure, and an on-call engineer monitoring live. Avoid chaos engineering on systems without circuit breakers, retries, or graceful degradation — you will just cause outages, not learn from them. Do not run experiments during high-traffic periods or near business-critical events — the blast radius is unpredictable.

When is Chaos Engineering the right choice?

Run chaos experiments after you have basic observability (metrics, tracing, alerting) — without it you cannot tell whether an experiment revealed a real weakness. Start with known-safe failure modes in staging before targeting production: kill one replica, throttle a dependency, introduce latency. Use chaos to validate specific resilience hypotheses: "the system stays available when service X is unavailable" — not just to break things randomly.

← Back to glossary

Chaos Engineering

devops PHP 5.0+ Advanced

Also Known As

chaos monkey resilience testing fault injection

TL;DR

Deliberately injecting failures into a production system to discover weaknesses before they cause unplanned outages.

Explanation

Chaos Engineering (Netflix, Chaos Monkey) proactively tests system resilience by introducing controlled failures: terminating random instances, introducing network latency, exhausting CPU/memory, or cutting off a downstream dependency. The hypothesis-driven approach: define steady-state metrics, hypothesise the system will maintain them under a given failure, inject the failure, observe. Deviations reveal weaknesses. Start in non-production, then graduate to production during low-traffic hours. PHP application chaos experiments: kill a PHP-FPM worker pool, simulate Redis unavailability, introduce 500ms MySQL latency. Tools: Chaos Monkey (AWS), Chaos Toolkit, LitmusChaos (Kubernetes), Gremlin.

Diagram

flowchart TD
    HYPO[Define hypothesis<br/>System stays healthy<br/>when service X fails] -->
    BASELINE[Measure steady state<br/>normal metrics] -->
    INJECT[Inject failure<br/>kill instance, add latency<br/>fill disk] -->
    OBSERVE[Observe<br/>metrics, alerts, user impact] -->
    COMPARE{Compare to
steady state}
    COMPARE -->|resilient| CONFIRM[Hypothesis confirmed<br/>system is robust]
    COMPARE -->|degraded| FIND[Weakness found<br/>fix before real incident]
    FIND --> IMPROVE[Improve resilience]
    IMPROVE --> HYPO
style CONFIRM fill:#238636,color:#fff
style FIND fill:#f85149,color:#fff
style IMPROVE fill:#238636,color:#fff

Watch Out

⚠ Chaos engineering without a defined steady-state hypothesis is just random sabotage — every experiment must state what normal behaviour looks like so you can measure whether the system maintained it.

Common Misconception

✗ Chaos engineering means randomly breaking production to find weaknesses. Chaos engineering follows a scientific method — formulate a hypothesis, define a blast radius, inject a specific failure, observe the system, and improve. Random destruction without hypothesis and measurement is just sabotage.

Why It Matters

Chaos engineering proactively injects failures into production systems to find weaknesses before real incidents do — systems that have never been broken are guaranteed to break unexpectedly.

Common Mistakes

Running chaos experiments without a hypothesis and metrics — chaos without measurement is just breaking things.
Starting with production before validating in staging — always start in a controlled environment.
No abort conditions — running an experiment without a clear 'stop if X happens' risks real outages.
Treating chaos engineering as a one-time exercise — it should be ongoing as the system evolves.

Avoid When

Do not run chaos experiments in production without a game day plan, rollback procedure, and an on-call engineer monitoring live.
Avoid chaos engineering on systems without circuit breakers, retries, or graceful degradation — you will just cause outages, not learn from them.
Do not run experiments during high-traffic periods or near business-critical events — the blast radius is unpredictable.

When To Use

Run chaos experiments after you have basic observability (metrics, tracing, alerting) — without it you cannot tell whether an experiment revealed a real weakness.
Start with known-safe failure modes in staging before targeting production: kill one replica, throttle a dependency, introduce latency.
Use chaos to validate specific resilience hypotheses: "the system stays available when service X is unavailable" — not just to break things randomly.

Code Examples

💡 NoteThe bad code assumes dependencies are always reliable and has no fallback; the good example wraps calls in a circuit breaker so a failure in one service degrades gracefully rather than cascading.

✗ Vulnerable

// No resilience testing — assuming dependencies are reliable:
function getUser(int $id): User {
    return $this->userService->fetch($id); // What if this times out?
    // No timeout set, no fallback, no circuit breaker
    // Chaos test: kill user-service → entire app hangs
}

✓ Fixed

# Chaos engineering — deliberately break things in controlled conditions
# to find weaknesses before they cause incidents

# Principles:
# 1. Define steady state (normal behaviour metrics)
# 2. Hypothesise: 'If we kill one DB, orders still process from replica'
# 3. Run experiment in staging first, then low-traffic production
# 4. Minimise blast radius — run during business hours with engineers ready

# PHP/infrastructure experiments:
# - Kill one Redis node — does session failover work?
# - Slow DB queries by 3s — does the circuit breaker trip?
# - Fill disk — does app fail gracefully or corrupt data?
# - Simulate OOM on PHP-FPM worker — does it restart cleanly?

# Tools: Gremlin, AWS Fault Injection Simulator, Chaos Monkey
# Chaos Toolkit (open source):
$ pip install chaostoolkit
$ chaos run experiments/kill-redis.yaml

Chaos Engineering

Also Known As

TL;DR

Explanation

Diagram

Watch Out

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

References

Tags

Chaos Engineering

Also Known As

TL;DR

Explanation

Diagram

Watch Out

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

References

Tags

Related Terms