{
    "slug": "error_recovery_patterns",
    "term": "Error Recovery Patterns",
    "category": "general",
    "difficulty": "intermediate",
    "short": "Design strategies for gracefully handling failures and restoring system functionality without data loss or user disruption.",
    "long": "Error recovery patterns are architectural and code-level strategies that allow systems to detect, handle, and recover from failures while preserving data integrity and user experience. Key patterns include retry with exponential backoff (retry transient failures with increasing delays), circuit breaker (stop calling failing services to prevent cascade), fallback (provide degraded functionality when primary fails), compensation (undo partial operations on failure), and checkpoint/restart (save progress to resume after crash). Recovery differs from error handling: handling catches the exception, recovery restores the system to a consistent state. Design for recovery from the start - retrofitting is expensive. Consider idempotency (safe to retry), observability (know when recovery happened), and graceful degradation (partial functionality beats total outage). Recovery patterns are especially critical in distributed systems where network partitions, service unavailability, and partial failures are routine rather than exceptional.",
    "aliases": [
        "fault recovery",
        "failure recovery strategies",
        "resilience patterns",
        "graceful degradation"
    ],
    "tags": [
        "general",
        "resilience",
        "distributed-systems",
        "fault-tolerance",
        "error-handling",
        "reliability"
    ],
    "misconception": "Error recovery means catching all exceptions and logging them - true recovery restores the system to a consistent state where operations can continue, not just acknowledging that something went wrong.",
    "why_it_matters": "Systems without recovery patterns turn transient failures into permanent outages - a brief network hiccup becomes a corrupted database state or lost customer order that requires manual intervention to fix.",
    "common_mistakes": [
        "Retrying without exponential backoff - hammering a struggling service makes recovery harder for everyone.",
        "No idempotency in retry logic - retrying a non-idempotent operation can duplicate side effects like payments or emails.",
        "Swallowing exceptions without restoring state - the error is hidden but the system remains in an inconsistent state.",
        "Missing compensation logic for multi-step operations - partial failure leaves data spread across services in conflicting states.",
        "Infinite retry loops without circuit breakers - a permanently failed dependency exhausts resources retrying forever."
    ],
    "when_to_use": [
        "Multi-step operations where partial failure leaves inconsistent state.",
        "External service calls that may fail transiently due to network or load.",
        "Long-running processes that should survive restarts.",
        "Financial or order processing where correctness is more important than availability."
    ],
    "avoid_when": [
        "Simple CRUD operations where database transactions provide atomicity.",
        "Fast-fail scenarios where immediate error feedback is more valuable than retry.",
        "Operations where the cost of retry exceeds the cost of failure."
    ],
    "related": [
        "idempotency",
        "defence_in_depth",
        "observability_pillars"
    ],
    "prerequisites": [
        "idempotency",
        "separation_of_concerns"
    ],
    "refs": [
        "https://docs.microsoft.com/en-us/azure/architecture/patterns/retry",
        "https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker",
        "https://microservices.io/patterns/reliability/circuit-breaker.html",
        "https://martinfowler.com/bliki/CircuitBreaker.html"
    ],
    "bad_code": "// No recovery - partial failure leaves inconsistent state\nfunction processOrder($order) {\n    $this->inventory->reserve($order->items);  // Step 1: succeeds\n    $this->payment->charge($order->total);      // Step 2: fails!\n    // Inventory is reserved but payment failed\n    // No cleanup, no retry, customer stuck, stock locked\n    $this->shipping->schedule($order);          // Never reached\n}\n\n// Retry without backoff - makes outage worse\nfunction callExternalApi($data) {\n    while (true) {\n        try {\n            return $this->api->send($data);\n        } catch (Exception $e) {\n            // Immediate retry - floods struggling service\n            continue;\n        }\n    }\n}",
    "good_code": "// Recovery pattern: compensation on failure\nfunction processOrder($order): OrderResult {\n    $reservationId = null;\n    $paymentId = null;\n    \n    try {\n        $reservationId = $this->inventory->reserve($order->items);\n        $paymentId = $this->payment->charge($order->total);\n        $this->shipping->schedule($order);\n        return OrderResult::success($order->id);\n    } catch (PaymentException $e) {\n        // Compensate: release inventory reservation\n        if ($reservationId) {\n            $this->inventory->release($reservationId);\n        }\n        return OrderResult::failed('Payment declined');\n    } catch (ShippingException $e) {\n        // Compensate: refund payment and release inventory\n        if ($paymentId) {\n            $this->payment->refund($paymentId);\n        }\n        if ($reservationId) {\n            $this->inventory->release($reservationId);\n        }\n        return OrderResult::failed('Shipping unavailable');\n    }\n}\n\n// Retry with exponential backoff and circuit breaker\nfunction callWithRecovery($operation, $maxRetries = 3): mixed {\n    if ($this->circuitBreaker->isOpen()) {\n        return $this->fallback->execute();\n    }\n    \n    $attempt = 0;\n    while ($attempt < $maxRetries) {\n        try {\n            $result = $operation();\n            $this->circuitBreaker->recordSuccess();\n            return $result;\n        } catch (TransientException $e) {\n            $attempt++;\n            $delay = min(100 * pow(2, $attempt), 10000); // 200ms, 400ms, 800ms... max 10s\n            usleep($delay * 1000);\n        }\n    }\n    \n    $this->circuitBreaker->recordFailure();\n    return $this->fallback->execute();\n}",
    "quick_fix": "For any multi-step operation, identify what compensating actions are needed if each step fails and implement them before the operation goes to production",
    "severity": "high",
    "effort": "medium",
    "created": "2026-05-02",
    "updated": "2026-05-02",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/error_recovery_patterns",
        "html_url": "https://codeclaritylab.com/glossary/error_recovery_patterns",
        "json_url": "https://codeclaritylab.com/glossary/error_recovery_patterns.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Error Recovery Patterns](https://codeclaritylab.com/glossary/error_recovery_patterns) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/error_recovery_patterns"
            }
        }
    }
}