{
    "slug": "ai_fallback_routing",
    "term": "AI Fallback Routing",
    "category": "ai_ml",
    "difficulty": "intermediate",
    "short": "Automatically routing LLM requests to alternative models or providers when the primary fails, times out, or returns unusable output.",
    "long": "AI fallback routing is the pattern of treating LLM calls as unreliable network operations that need explicit failure handling beyond a single retry. A production system rarely depends on one model: the primary might be a frontier model from one provider, with a cheaper model from a second provider as the next hop, an in-house or open-source model as a third hop, and finally a degraded non-AI path (cached response, templated reply, or hard error). The router decides which path to take based on signals: HTTP errors (429 rate limit, 5xx, connection timeout), structured output that fails schema validation, refusals or safety blocks, latency budgets exceeded, or cost ceilings hit for a tenant. Good fallback routing distinguishes transient from terminal failures. A 429 from OpenAI should first retry with backoff against the same model, then fail over to Anthropic; a schema validation error should retry with a stricter prompt before switching models; a content filter block should usually surface to the user, not silently retry on another provider that has weaker safety. Routing also needs to be observable - every fallback hop should emit a metric tagged with reason, source model, target model, and tenant, because a slow drift from primary to fallback is often the first sign of a provider incident, a prompt regression, or a cost attack. Common implementations use a chain-of-responsibility or strategy pattern over a unified client interface (LiteLLM, Portkey, OpenRouter, or in-house wrappers), with circuit breakers to stop hammering a degraded provider and to recover automatically once health checks pass. The trap is that fallback models are not equivalent: response format, tool calling syntax, context windows, and instruction-following all differ, so a fallback that works syntactically can silently degrade quality in ways no exception will reveal.",
    "aliases": [
        "llm failover",
        "model routing",
        "multi-provider routing",
        "llm fallback chain"
    ],
    "tags": [
        "ai_ml",
        "reliability",
        "llm-deployment",
        "circuit-breaker",
        "multi-provider",
        "resilience"
    ],
    "misconception": "Fallback routing is just a try/except around the API call that swaps in a different model on error. In reality, models behave differently enough that a naive swap can silently corrupt structured output, change tool-call formats, or bypass safety filters - fallbacks need per-target adapters and quality gates, not just exception handling.",
    "why_it_matters": "Single-provider LLM dependencies are operational time bombs - rate limits, regional outages, and model deprecations are routine - and a well-designed fallback chain is the difference between graceful degradation and a feature outage that takes down dependent products.",
    "common_mistakes": [
        "Treating any exception as a reason to fall over, including content-policy blocks that should surface to the user instead.",
        "Not validating that the fallback model produces the same output shape (JSON schema, tool calls), causing downstream parsers to break on failover.",
        "Omitting circuit breakers, so a degraded primary gets hammered with retries that worsen the incident and drive up cost.",
        "Failing over to a cheaper model without re-running quality evals, so users on the fallback path silently get worse answers.",
        "Missing per-fallback observability, leaving the team blind to a slow shift from primary to backup that signals a real problem."
    ],
    "when_to_use": [
        "User-facing features where a provider outage would otherwise take the product offline.",
        "High-volume workloads where rate limits on a single provider are a routine bottleneck.",
        "Cost-sensitive paths that can downshift to a cheaper model under load while preserving the primary for premium tenants.",
        "Multi-region or regulated deployments that need provider diversity for availability or data-residency reasons."
    ],
    "avoid_when": [
        "Single low-stakes call paths where a hard failure is preferable to a degraded answer (e.g. dev tooling, internal one-shot scripts).",
        "Strict compliance contexts where the fallback provider is not approved for the data classification involved.",
        "Cases where the failure is a content policy block - silently re-routing to a less-restrictive provider is a safety regression, not resilience.",
        "Pipelines without per-model evaluation, where the fallback's quality is unknown and could ship worse answers than an outage would."
    ],
    "related": [
        "ai_cost_management",
        "ai_observability",
        "large_language_models",
        "llm_structured_output",
        "ai_guardrails",
        "ai_evaluation_metrics"
    ],
    "prerequisites": [
        "large_language_models",
        "llm_structured_output",
        "ai_observability"
    ],
    "refs": [
        "https://docs.litellm.ai/docs/routing",
        "https://portkey.ai/docs/product/ai-gateway/fallbacks",
        "https://martinfowler.com/bliki/CircuitBreaker.html",
        "https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/"
    ],
    "bad_code": "# Naive try/except fallback - swaps providers blindly\nimport openai\nimport anthropic\n\ndef get_completion(prompt: str) -> str:\n    try:\n        r = openai.chat.completions.create(\n            model='gpt-4o',\n            messages=[{'role': 'user', 'content': prompt}],\n            response_format={'type': 'json_object'},\n        )\n        return r.choices[0].message.content\n    except Exception:\n        # Any failure flips to Claude - including content policy blocks\n        # Claude has no 'response_format' equivalent here, so parsers break\n        # No backoff, no circuit breaker, no metrics, no quality check\n        r = anthropic.Anthropic().messages.create(\n            model='claude-3-5-sonnet-latest',\n            max_tokens=1024,\n            messages=[{'role': 'user', 'content': prompt}],\n        )\n        return r.content[0].text",
    "good_code": "import time, json, logging\nfrom dataclasses import dataclass\nfrom jsonschema import validate, ValidationError\n\nlog = logging.getLogger(__name__)\n\nclass ContentPolicyBlock(Exception):\n    \"\"\"Raised by adapters when a provider returns a safety/policy refusal.\"\"\"\n\nclass CircuitBreaker:\n    \"\"\"Minimal interface; real impl tracks failure window + half-open probes.\"\"\"\n    def allow(self) -> bool: ...\n    def record_success(self) -> None: ...\n    def record_failure(self) -> None: ...\n\n@dataclass\nclass Route:\n    name: str\n    call: callable        # adapter that returns a normalised string\n    breaker: CircuitBreaker\n\nTRANSIENT = (TimeoutError, ConnectionError)\n\ndef route_with_fallback(prompt: str, schema: dict, routes: list[Route]) -> str:\n    last_err = None\n    for route in routes:\n        if not route.breaker.allow():\n            log.info('fallback.skip', extra={'route': route.name, 'reason': 'breaker_open'})\n            continue\n        for attempt in range(2):  # one retry before failing over\n            try:\n                raw = route.call(prompt)\n                validate(json.loads(raw), schema)  # quality gate\n                route.breaker.record_success()\n                log.info('fallback.ok', extra={'route': route.name, 'attempt': attempt})\n                return raw\n            except ContentPolicyBlock:\n                raise  # surface to user, do not fail over\n            except (ValidationError, *TRANSIENT) as e:\n                last_err = e\n                time.sleep(0.2 * (2 ** attempt))\n            except Exception as e:\n                last_err = e\n                route.breaker.record_failure()\n                log.warning('fallback.next', extra={'route': route.name, 'err': str(e)})\n                break\n    raise RuntimeError(f'all routes exhausted: {last_err}')",
    "quick_fix": "Wrap LLM calls in a router with per-provider adapters, circuit breakers, schema validation, classified exceptions (transient vs terminal vs policy), and metrics tagged with route and reason.",
    "severity": "high",
    "effort": "medium",
    "created": "2026-05-21",
    "updated": "2026-05-21",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/ai_fallback_routing",
        "html_url": "https://codeclaritylab.com/glossary/ai_fallback_routing",
        "json_url": "https://codeclaritylab.com/glossary/ai_fallback_routing.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[AI Fallback Routing](https://codeclaritylab.com/glossary/ai_fallback_routing) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/ai_fallback_routing"
            }
        }
    }
}