← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

AI Fallback Routing

ai_ml Intermediate
debt(d7/e7/b5/t7)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7), detection_hints.automated is no and the only signal is a code pattern grep for try/except around provider SDKs; quality drift on fallback paths is invisible without targeted evals.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7), the quick_fix requires introducing a router abstraction with per-provider adapters, circuit breakers, schema validation, and tagged metrics — every LLM call site must move through it.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5), applies_to spans web/queue/cli/library and every new LLM feature must conform to the router's adapter contract and eval gates, but it's a contained subsystem rather than defining overall system shape.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7), the misconception explicitly notes developers think it's just try/except swapping models, when in reality naive swaps silently corrupt structured output, break tool-call formats, and bypass policy signals — contradicts how ordinary exception fallback works.

About DEBT scoring →

Also Known As

llm failover model routing multi-provider routing llm fallback chain

TL;DR

Automatically routing LLM requests to alternative models or providers when the primary fails, times out, or returns unusable output.

Explanation

AI fallback routing is the pattern of treating LLM calls as unreliable network operations that need explicit failure handling beyond a single retry. A production system rarely depends on one model: the primary might be a frontier model from one provider, with a cheaper model from a second provider as the next hop, an in-house or open-source model as a third hop, and finally a degraded non-AI path (cached response, templated reply, or hard error). The router decides which path to take based on signals: HTTP errors (429 rate limit, 5xx, connection timeout), structured output that fails schema validation, refusals or safety blocks, latency budgets exceeded, or cost ceilings hit for a tenant. Good fallback routing distinguishes transient from terminal failures. A 429 from OpenAI should first retry with backoff against the same model, then fail over to Anthropic; a schema validation error should retry with a stricter prompt before switching models; a content filter block should usually surface to the user, not silently retry on another provider that has weaker safety. Routing also needs to be observable - every fallback hop should emit a metric tagged with reason, source model, target model, and tenant, because a slow drift from primary to fallback is often the first sign of a provider incident, a prompt regression, or a cost attack. Common implementations use a chain-of-responsibility or strategy pattern over a unified client interface (LiteLLM, Portkey, OpenRouter, or in-house wrappers), with circuit breakers to stop hammering a degraded provider and to recover automatically once health checks pass. The trap is that fallback models are not equivalent: response format, tool calling syntax, context windows, and instruction-following all differ, so a fallback that works syntactically can silently degrade quality in ways no exception will reveal.

Common Misconception

Fallback routing is just a try/except around the API call that swaps in a different model on error. In reality, models behave differently enough that a naive swap can silently corrupt structured output, change tool-call formats, or bypass safety filters - fallbacks need per-target adapters and quality gates, not just exception handling.

Why It Matters

Single-provider LLM dependencies are operational time bombs - rate limits, regional outages, and model deprecations are routine - and a well-designed fallback chain is the difference between graceful degradation and a feature outage that takes down dependent products.

Common Mistakes

  • Treating any exception as a reason to fall over, including content-policy blocks that should surface to the user instead.
  • Not validating that the fallback model produces the same output shape (JSON schema, tool calls), causing downstream parsers to break on failover.
  • Omitting circuit breakers, so a degraded primary gets hammered with retries that worsen the incident and drive up cost.
  • Failing over to a cheaper model without re-running quality evals, so users on the fallback path silently get worse answers.
  • Missing per-fallback observability, leaving the team blind to a slow shift from primary to backup that signals a real problem.

Avoid When

  • Single low-stakes call paths where a hard failure is preferable to a degraded answer (e.g. dev tooling, internal one-shot scripts).
  • Strict compliance contexts where the fallback provider is not approved for the data classification involved.
  • Cases where the failure is a content policy block - silently re-routing to a less-restrictive provider is a safety regression, not resilience.
  • Pipelines without per-model evaluation, where the fallback's quality is unknown and could ship worse answers than an outage would.

When To Use

  • User-facing features where a provider outage would otherwise take the product offline.
  • High-volume workloads where rate limits on a single provider are a routine bottleneck.
  • Cost-sensitive paths that can downshift to a cheaper model under load while preserving the primary for premium tenants.
  • Multi-region or regulated deployments that need provider diversity for availability or data-residency reasons.

Code Examples

✗ Vulnerable
# Naive try/except fallback - swaps providers blindly
import openai
import anthropic

def get_completion(prompt: str) -> str:
    try:
        r = openai.chat.completions.create(
            model='gpt-4o',
            messages=[{'role': 'user', 'content': prompt}],
            response_format={'type': 'json_object'},
        )
        return r.choices[0].message.content
    except Exception:
        # Any failure flips to Claude - including content policy blocks
        # Claude has no 'response_format' equivalent here, so parsers break
        # No backoff, no circuit breaker, no metrics, no quality check
        r = anthropic.Anthropic().messages.create(
            model='claude-3-5-sonnet-latest',
            max_tokens=1024,
            messages=[{'role': 'user', 'content': prompt}],
        )
        return r.content[0].text
✓ Fixed
import time, json, logging
from dataclasses import dataclass
from jsonschema import validate, ValidationError

log = logging.getLogger(__name__)

class ContentPolicyBlock(Exception):
    """Raised by adapters when a provider returns a safety/policy refusal."""

class CircuitBreaker:
    """Minimal interface; real impl tracks failure window + half-open probes."""
    def allow(self) -> bool: ...
    def record_success(self) -> None: ...
    def record_failure(self) -> None: ...

@dataclass
class Route:
    name: str
    call: callable        # adapter that returns a normalised string
    breaker: CircuitBreaker

TRANSIENT = (TimeoutError, ConnectionError)

def route_with_fallback(prompt: str, schema: dict, routes: list[Route]) -> str:
    last_err = None
    for route in routes:
        if not route.breaker.allow():
            log.info('fallback.skip', extra={'route': route.name, 'reason': 'breaker_open'})
            continue
        for attempt in range(2):  # one retry before failing over
            try:
                raw = route.call(prompt)
                validate(json.loads(raw), schema)  # quality gate
                route.breaker.record_success()
                log.info('fallback.ok', extra={'route': route.name, 'attempt': attempt})
                return raw
            except ContentPolicyBlock:
                raise  # surface to user, do not fail over
            except (ValidationError, *TRANSIENT) as e:
                last_err = e
                time.sleep(0.2 * (2 ** attempt))
            except Exception as e:
                last_err = e
                route.breaker.record_failure()
                log.warning('fallback.next', extra={'route': route.name, 'err': str(e)})
                break
    raise RuntimeError(f'all routes exhausted: {last_err}')

Added 21 May 2026
Views 24
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 4 pings S 1 ping S 1 ping M 1 ping T 1 ping W 2 pings T 0 pings F 1 ping S 0 pings S 0 pings M 2 pings T 0 pings W
No pings yet today
Perplexity 1
Perplexity 6 Google 2 ChatGPT 2 Bing 2 Amazonbot 1 Ahrefs 1 Meta AI 1
crawler 12 crawler_json 3
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Wrap LLM calls in a router with per-provider adapters, circuit breakers, schema validation, classified exceptions (transient vs terminal vs policy), and metrics tagged with route and reason.
📦 Applies To
any web queue-worker cli library
🔗 Prerequisites
🔍 Detection Hints
try:\s*[\s\S]{0,300}(openai|anthropic|bedrock)[\s\S]{0,300}except[\s\S]{0,200}(openai|anthropic|bedrock)
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Medium False Positives: Medium ✗ Manual fix Fix: Medium Context: File Tests: Update

✓ schema.org compliant