Error Recovery Patterns
Also Known As
TL;DR
Explanation
Error recovery patterns are architectural and code-level strategies that allow systems to detect, handle, and recover from failures while preserving data integrity and user experience. Key patterns include retry with exponential backoff (retry transient failures with increasing delays), circuit breaker (stop calling failing services to prevent cascade), fallback (provide degraded functionality when primary fails), compensation (undo partial operations on failure), and checkpoint/restart (save progress to resume after crash). Recovery differs from error handling: handling catches the exception, recovery restores the system to a consistent state. Design for recovery from the start - retrofitting is expensive. Consider idempotency (safe to retry), observability (know when recovery happened), and graceful degradation (partial functionality beats total outage). Recovery patterns are especially critical in distributed systems where network partitions, service unavailability, and partial failures are routine rather than exceptional.
Common Misconception
Why It Matters
Common Mistakes
- Retrying without exponential backoff - hammering a struggling service makes recovery harder for everyone.
- No idempotency in retry logic - retrying a non-idempotent operation can duplicate side effects like payments or emails.
- Swallowing exceptions without restoring state - the error is hidden but the system remains in an inconsistent state.
- Missing compensation logic for multi-step operations - partial failure leaves data spread across services in conflicting states.
- Infinite retry loops without circuit breakers - a permanently failed dependency exhausts resources retrying forever.
Avoid When
- Simple CRUD operations where database transactions provide atomicity.
- Fast-fail scenarios where immediate error feedback is more valuable than retry.
- Operations where the cost of retry exceeds the cost of failure.
When To Use
- Multi-step operations where partial failure leaves inconsistent state.
- External service calls that may fail transiently due to network or load.
- Long-running processes that should survive restarts.
- Financial or order processing where correctness is more important than availability.
Code Examples
// No recovery - partial failure leaves inconsistent state
function processOrder($order) {
$this->inventory->reserve($order->items); // Step 1: succeeds
$this->payment->charge($order->total); // Step 2: fails!
// Inventory is reserved but payment failed
// No cleanup, no retry, customer stuck, stock locked
$this->shipping->schedule($order); // Never reached
}
// Retry without backoff - makes outage worse
function callExternalApi($data) {
while (true) {
try {
return $this->api->send($data);
} catch (Exception $e) {
// Immediate retry - floods struggling service
continue;
}
}
}
// Recovery pattern: compensation on failure
function processOrder($order): OrderResult {
$reservationId = null;
$paymentId = null;
try {
$reservationId = $this->inventory->reserve($order->items);
$paymentId = $this->payment->charge($order->total);
$this->shipping->schedule($order);
return OrderResult::success($order->id);
} catch (PaymentException $e) {
// Compensate: release inventory reservation
if ($reservationId) {
$this->inventory->release($reservationId);
}
return OrderResult::failed('Payment declined');
} catch (ShippingException $e) {
// Compensate: refund payment and release inventory
if ($paymentId) {
$this->payment->refund($paymentId);
}
if ($reservationId) {
$this->inventory->release($reservationId);
}
return OrderResult::failed('Shipping unavailable');
}
}
// Retry with exponential backoff and circuit breaker
function callWithRecovery($operation, $maxRetries = 3): mixed {
if ($this->circuitBreaker->isOpen()) {
return $this->fallback->execute();
}
$attempt = 0;
while ($attempt < $maxRetries) {
try {
$result = $operation();
$this->circuitBreaker->recordSuccess();
return $result;
} catch (TransientException $e) {
$attempt++;
$delay = min(100 * pow(2, $attempt), 10000); // 200ms, 400ms, 800ms... max 10s
usleep($delay * 1000);
}
}
$this->circuitBreaker->recordFailure();
return $this->fallback->execute();
}