Error Recovery Patterns
debt(d9/e7/b7/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints explicitly state 'automated: no', and the code_pattern regex only catches the absence of finally blocks in catch clauses — it cannot detect missing compensation logic, non-idempotent retries, or absent circuit breakers. Missing recovery patterns produce no compiler or linter warnings; they surface only when transient failures cascade into corrupted state or outages that real users encounter.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix requires identifying compensating actions for every step of every multi-step operation before going to production. The common_mistakes list spans idempotency design, exponential backoff, circuit breakers, and saga/compensation logic — each touching multiple layers (persistence, messaging, external calls). Retrofitting true recovery patterns into an existing system is not a single-file change; it requires redesigning operation boundaries and state management across services.
Closest to 'strong gravitational pull' (e7, scored b7). The term applies_to web, cli, and queue-worker — all three major contexts. Tags include distributed-systems and fault-tolerance, indicating architectural reach. Once a recovery strategy (or lack thereof) is baked in, every new multi-step operation, every new external dependency, and every retry policy must conform to or work around the existing approach. The choice shapes how new features are built and how incidents are handled.
Closest to 'serious trap' (t7). The misconception field states explicitly that developers conflate 'catching exceptions and logging them' with true recovery. This is a well-documented but widely held wrong belief — the 'obvious' defensive coding (try/catch/log) feels like recovery but leaves the system in inconsistent state. This contradicts the intuition developers build from simpler error-handling patterns (e.g., returning error codes), making it a serious cognitive trap that contradicts expectations from adjacent concepts.
Also Known As
TL;DR
Explanation
Error recovery patterns are architectural and code-level strategies that allow systems to detect, handle, and recover from failures while preserving data integrity and user experience. Key patterns include retry with exponential backoff (retry transient failures with increasing delays), circuit breaker (stop calling failing services to prevent cascade), fallback (provide degraded functionality when primary fails), compensation (undo partial operations on failure), and checkpoint/restart (save progress to resume after crash). Recovery differs from error handling: handling catches the exception, recovery restores the system to a consistent state. Design for recovery from the start - retrofitting is expensive. Consider idempotency (safe to retry), observability (know when recovery happened), and graceful degradation (partial functionality beats total outage). Recovery patterns are especially critical in distributed systems where network partitions, service unavailability, and partial failures are routine rather than exceptional.
Common Misconception
Why It Matters
Common Mistakes
- Retrying without exponential backoff - hammering a struggling service makes recovery harder for everyone.
- No idempotency in retry logic - retrying a non-idempotent operation can duplicate side effects like payments or emails.
- Swallowing exceptions without restoring state - the error is hidden but the system remains in an inconsistent state.
- Missing compensation logic for multi-step operations - partial failure leaves data spread across services in conflicting states.
- Infinite retry loops without circuit breakers - a permanently failed dependency exhausts resources retrying forever.
Avoid When
- Simple CRUD operations where database transactions provide atomicity.
- Fast-fail scenarios where immediate error feedback is more valuable than retry.
- Operations where the cost of retry exceeds the cost of failure.
When To Use
- Multi-step operations where partial failure leaves inconsistent state.
- External service calls that may fail transiently due to network or load.
- Long-running processes that should survive restarts.
- Financial or order processing where correctness is more important than availability.
Code Examples
// No recovery - partial failure leaves inconsistent state
function processOrder($order) {
$this->inventory->reserve($order->items); // Step 1: succeeds
$this->payment->charge($order->total); // Step 2: fails!
// Inventory is reserved but payment failed
// No cleanup, no retry, customer stuck, stock locked
$this->shipping->schedule($order); // Never reached
}
// Retry without backoff - makes outage worse
function callExternalApi($data) {
while (true) {
try {
return $this->api->send($data);
} catch (Exception $e) {
// Immediate retry - floods struggling service
continue;
}
}
}
// Recovery pattern: compensation on failure
function processOrder($order): OrderResult {
$reservationId = null;
$paymentId = null;
try {
$reservationId = $this->inventory->reserve($order->items);
$paymentId = $this->payment->charge($order->total);
$this->shipping->schedule($order);
return OrderResult::success($order->id);
} catch (PaymentException $e) {
// Compensate: release inventory reservation
if ($reservationId) {
$this->inventory->release($reservationId);
}
return OrderResult::failed('Payment declined');
} catch (ShippingException $e) {
// Compensate: refund payment and release inventory
if ($paymentId) {
$this->payment->refund($paymentId);
}
if ($reservationId) {
$this->inventory->release($reservationId);
}
return OrderResult::failed('Shipping unavailable');
}
}
// Retry with exponential backoff and circuit breaker
function callWithRecovery($operation, $maxRetries = 3): mixed {
if ($this->circuitBreaker->isOpen()) {
return $this->fallback->execute();
}
$attempt = 0;
while ($attempt < $maxRetries) {
try {
$result = $operation();
$this->circuitBreaker->recordSuccess();
return $result;
} catch (TransientException $e) {
$attempt++;
$delay = min(100 * pow(2, $attempt), 10000); // 200ms, 400ms, 800ms... max 10s
usleep($delay * 1000);
}
}
$this->circuitBreaker->recordFailure();
return $this->fallback->execute();
}