{
    "slug": "ai_model_selection_criteria",
    "term": "AI Model Selection Criteria",
    "category": "ai_ml",
    "difficulty": "intermediate",
    "short": "The systematic factors engineers weigh when choosing an LLM for a task: capability, cost, latency, context window, modality, hosting, and licensing.",
    "long": "Choosing an LLM for a production feature is an engineering trade-off, not a benchmark beauty contest. The dimensions that matter in practice are: task-specific capability (does it actually solve your task on your data, not just on MMLU), cost per million input and output tokens, latency at p50 and p99, context window size, supported modalities (text, vision, audio), structured output reliability (JSON mode, function calling), rate limits and quota, data residency and privacy (can you send your data to this provider), licensing terms for outputs, fine-tuning availability, and hosting options (managed API vs self-hosted open weights). A common anti-pattern is defaulting to the largest frontier model for every task — most production workloads (classification, extraction, summarisation, routing) run well on smaller, cheaper, faster models. Conversely, teams sometimes pick a tiny model to save costs and then spend weeks on prompt engineering to compensate for capability gaps. The right method is to define an evaluation set drawn from real production data, score 3-5 candidate models against it on accuracy, cost, and latency, and pick the cheapest model that meets your quality bar. Re-evaluate quarterly: model providers ship faster, cheaper variants frequently, and a model swap can cut your bill significantly. Also consider operational concerns: vendor lock-in (proprietary function-calling formats), SLA guarantees, regional availability, and whether the provider trains on your data by default.",
    "aliases": [
        "LLM selection",
        "model selection",
        "choosing an LLM",
        "model evaluation criteria"
    ],
    "tags": [
        "ai_ml",
        "llm",
        "model-selection",
        "cost-optimisation",
        "evaluation"
    ],
    "misconception": "The model with the highest benchmark score is always the right choice — in practice the cheapest model that passes your task-specific evaluation set is usually the right choice, since cost and latency dominate at scale.",
    "why_it_matters": "Model choice is the single largest lever on the cost and latency of an AI feature; picking wrong inflates your bill by 10x and slows responses to seconds, while picking right unlocks features that would otherwise be uneconomic.",
    "common_mistakes": [
        "Choosing the most powerful frontier model for every task without testing whether a smaller model is sufficient.",
        "Picking a model based on public benchmarks (MMLU, HumanEval) instead of an evaluation set built from your actual production data.",
        "Ignoring p99 latency — the model averages well but tail responses time out user requests.",
        "Not checking data privacy terms — some providers train on customer data by default unless you opt out or use an enterprise tier.",
        "Locking into provider-specific features (proprietary tool-calling schemas) without an abstraction layer, making future migration painful."
    ],
    "when_to_use": [
        "Before launching an AI feature to production with meaningful traffic, where model cost or latency will impact the business.",
        "When AI API spend becomes a noticeable line item — re-run selection quarterly as new model variants ship.",
        "When latency SLAs are tight and a smaller, faster model could meet quality requirements.",
        "When data residency, privacy, or licensing constraints rule out certain providers and you need to compare the viable subset."
    ],
    "avoid_when": [
        "Prototyping or early-stage exploration — pick a capable default and defer selection until you have real usage data.",
        "The task is one-off or low-volume — the engineering effort to compare models exceeds the savings."
    ],
    "related": [
        "large_language_models",
        "ai_evaluation_metrics",
        "ai_cost_management",
        "ai_observability",
        "fine_tuning",
        "llm_context_window",
        "llm_structured_output"
    ],
    "prerequisites": [
        "large_language_models",
        "ai_evaluation_metrics",
        "ai_cost_management"
    ],
    "refs": [
        "https://platform.openai.com/docs/models",
        "https://docs.anthropic.com/en/docs/about-claude/models",
        "https://artificialanalysis.ai",
        "https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard"
    ],
    "bad_code": "// Default to the biggest frontier model for every task\n$response = $llm->complete([\n    'model'  => 'gpt-4o',  // expensive, slow, overkill for classification\n    'prompt' => \"Classify this support ticket as: billing, technical, account, other.\\n\\n{$ticket}\",\n]);\n// No evaluation, no cost tracking, no comparison to cheaper alternatives",
    "good_code": "// Evaluate candidates against real data before committing\n$candidates = [\n    ['model' => 'gpt-4o-mini',              'cost_per_1m_in' => 0.15],\n    ['model' => 'claude-3-5-haiku-latest',  'cost_per_1m_in' => 1.00],\n    ['model' => 'gpt-4o',                   'cost_per_1m_in' => 2.50],\n];\n\n$evalSet = ProductionTickets::sample(200); // real labelled data\n\nforeach ($candidates as $candidate) {\n    $scores = [];\n    $latencies = [];\n    foreach ($evalSet as $ticket) {\n        $start = microtime(true);\n        $prediction = $llm->classify($candidate['model'], $ticket->text);\n        $latencies[] = (microtime(true) - $start) * 1000;\n        $scores[]    = $prediction === $ticket->label ? 1 : 0;\n    }\n    $results[$candidate['model']] = [\n        'accuracy'    => array_sum($scores) / count($scores),\n        'p99_ms'      => Stats::percentile($latencies, 99),\n        'cost_per_1m' => $candidate['cost_per_1m_in'],\n    ];\n}\n\n// Pick cheapest model meeting 95% accuracy and p99 < 2000ms\n$chosen = ModelPicker::cheapestPassing($results, accuracy: 0.95, p99Ms: 2000);",
    "quick_fix": "Build a 50-200 example evaluation set from real production data, run 3-5 candidate models against it, and pick the cheapest model that meets your quality bar at acceptable latency",
    "severity": "medium",
    "effort": "medium",
    "created": "2026-05-21",
    "updated": "2026-05-21",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/ai_model_selection_criteria",
        "html_url": "https://codeclaritylab.com/glossary/ai_model_selection_criteria",
        "json_url": "https://codeclaritylab.com/glossary/ai_model_selection_criteria.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[AI Model Selection Criteria](https://codeclaritylab.com/glossary/ai_model_selection_criteria) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/ai_model_selection_criteria"
            }
        }
    }
}