AI Fallback Routing
debt(d7/e7/b5/t7)
Closest to 'only careful code review or runtime testing' (d7), detection_hints.automated is no and the only signal is a code pattern grep for try/except around provider SDKs; quality drift on fallback paths is invisible without targeted evals.
Closest to 'cross-cutting refactor across the codebase' (e7), the quick_fix requires introducing a router abstraction with per-provider adapters, circuit breakers, schema validation, and tagged metrics — every LLM call site must move through it.
Closest to 'persistent productivity tax' (b5), applies_to spans web/queue/cli/library and every new LLM feature must conform to the router's adapter contract and eval gates, but it's a contained subsystem rather than defining overall system shape.
Closest to 'serious trap' (t7), the misconception explicitly notes developers think it's just try/except swapping models, when in reality naive swaps silently corrupt structured output, break tool-call formats, and bypass policy signals — contradicts how ordinary exception fallback works.
Also Known As
TL;DR
Explanation
AI fallback routing is the pattern of treating LLM calls as unreliable network operations that need explicit failure handling beyond a single retry. A production system rarely depends on one model: the primary might be a frontier model from one provider, with a cheaper model from a second provider as the next hop, an in-house or open-source model as a third hop, and finally a degraded non-AI path (cached response, templated reply, or hard error). The router decides which path to take based on signals: HTTP errors (429 rate limit, 5xx, connection timeout), structured output that fails schema validation, refusals or safety blocks, latency budgets exceeded, or cost ceilings hit for a tenant. Good fallback routing distinguishes transient from terminal failures. A 429 from OpenAI should first retry with backoff against the same model, then fail over to Anthropic; a schema validation error should retry with a stricter prompt before switching models; a content filter block should usually surface to the user, not silently retry on another provider that has weaker safety. Routing also needs to be observable - every fallback hop should emit a metric tagged with reason, source model, target model, and tenant, because a slow drift from primary to fallback is often the first sign of a provider incident, a prompt regression, or a cost attack. Common implementations use a chain-of-responsibility or strategy pattern over a unified client interface (LiteLLM, Portkey, OpenRouter, or in-house wrappers), with circuit breakers to stop hammering a degraded provider and to recover automatically once health checks pass. The trap is that fallback models are not equivalent: response format, tool calling syntax, context windows, and instruction-following all differ, so a fallback that works syntactically can silently degrade quality in ways no exception will reveal.
Common Misconception
Why It Matters
Common Mistakes
- Treating any exception as a reason to fall over, including content-policy blocks that should surface to the user instead.
- Not validating that the fallback model produces the same output shape (JSON schema, tool calls), causing downstream parsers to break on failover.
- Omitting circuit breakers, so a degraded primary gets hammered with retries that worsen the incident and drive up cost.
- Failing over to a cheaper model without re-running quality evals, so users on the fallback path silently get worse answers.
- Missing per-fallback observability, leaving the team blind to a slow shift from primary to backup that signals a real problem.
Avoid When
- Single low-stakes call paths where a hard failure is preferable to a degraded answer (e.g. dev tooling, internal one-shot scripts).
- Strict compliance contexts where the fallback provider is not approved for the data classification involved.
- Cases where the failure is a content policy block - silently re-routing to a less-restrictive provider is a safety regression, not resilience.
- Pipelines without per-model evaluation, where the fallback's quality is unknown and could ship worse answers than an outage would.
When To Use
- User-facing features where a provider outage would otherwise take the product offline.
- High-volume workloads where rate limits on a single provider are a routine bottleneck.
- Cost-sensitive paths that can downshift to a cheaper model under load while preserving the primary for premium tenants.
- Multi-region or regulated deployments that need provider diversity for availability or data-residency reasons.
Code Examples
# Naive try/except fallback - swaps providers blindly
import openai
import anthropic
def get_completion(prompt: str) -> str:
try:
r = openai.chat.completions.create(
model='gpt-4o',
messages=[{'role': 'user', 'content': prompt}],
response_format={'type': 'json_object'},
)
return r.choices[0].message.content
except Exception:
# Any failure flips to Claude - including content policy blocks
# Claude has no 'response_format' equivalent here, so parsers break
# No backoff, no circuit breaker, no metrics, no quality check
r = anthropic.Anthropic().messages.create(
model='claude-3-5-sonnet-latest',
max_tokens=1024,
messages=[{'role': 'user', 'content': prompt}],
)
return r.content[0].text
import time, json, logging
from dataclasses import dataclass
from jsonschema import validate, ValidationError
log = logging.getLogger(__name__)
class ContentPolicyBlock(Exception):
"""Raised by adapters when a provider returns a safety/policy refusal."""
class CircuitBreaker:
"""Minimal interface; real impl tracks failure window + half-open probes."""
def allow(self) -> bool: ...
def record_success(self) -> None: ...
def record_failure(self) -> None: ...
@dataclass
class Route:
name: str
call: callable # adapter that returns a normalised string
breaker: CircuitBreaker
TRANSIENT = (TimeoutError, ConnectionError)
def route_with_fallback(prompt: str, schema: dict, routes: list[Route]) -> str:
last_err = None
for route in routes:
if not route.breaker.allow():
log.info('fallback.skip', extra={'route': route.name, 'reason': 'breaker_open'})
continue
for attempt in range(2): # one retry before failing over
try:
raw = route.call(prompt)
validate(json.loads(raw), schema) # quality gate
route.breaker.record_success()
log.info('fallback.ok', extra={'route': route.name, 'attempt': attempt})
return raw
except ContentPolicyBlock:
raise # surface to user, do not fail over
except (ValidationError, *TRANSIENT) as e:
last_err = e
time.sleep(0.2 * (2 ** attempt))
except Exception as e:
last_err = e
route.breaker.record_failure()
log.warning('fallback.next', extra={'route': route.name, 'err': str(e)})
break
raise RuntimeError(f'all routes exhausted: {last_err}')