← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

AI Model Selection Criteria

ai_ml Intermediate
debt(d9/e7/b7/t7)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9), detection_hints.automated is no — there's no tool that flags 'you picked the wrong model'; it shows up as inflated bills or latency complaints in production.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7), quick_fix requires building an eval set and re-testing, but if the codebase hardcodes model names everywhere (per code_pattern) and locks into provider-specific tool-calling schemas, swapping models is a cross-cutting change touching every AI call site.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7), applies to web/cli/queue-worker — model choice shapes cost structure, latency budgets, prompt engineering, and provider lock-in across the entire AI surface of the system; every new feature is shaped by it.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7), the misconception that highest-benchmark = best choice contradicts the actual economics — competent devs reach for frontier models by default, and the cheapest-passing-eval heuristic is counterintuitive until you've been burned by a bill or latency SLA.

About DEBT scoring →

Also Known As

LLM selection model selection choosing an LLM model evaluation criteria

TL;DR

The systematic factors engineers weigh when choosing an LLM for a task: capability, cost, latency, context window, modality, hosting, and licensing.

Explanation

Choosing an LLM for a production feature is an engineering trade-off, not a benchmark beauty contest. The dimensions that matter in practice are: task-specific capability (does it actually solve your task on your data, not just on MMLU), cost per million input and output tokens, latency at p50 and p99, context window size, supported modalities (text, vision, audio), structured output reliability (JSON mode, function calling), rate limits and quota, data residency and privacy (can you send your data to this provider), licensing terms for outputs, fine-tuning availability, and hosting options (managed API vs self-hosted open weights). A common anti-pattern is defaulting to the largest frontier model for every task — most production workloads (classification, extraction, summarisation, routing) run well on smaller, cheaper, faster models. Conversely, teams sometimes pick a tiny model to save costs and then spend weeks on prompt engineering to compensate for capability gaps. The right method is to define an evaluation set drawn from real production data, score 3-5 candidate models against it on accuracy, cost, and latency, and pick the cheapest model that meets your quality bar. Re-evaluate quarterly: model providers ship faster, cheaper variants frequently, and a model swap can cut your bill significantly. Also consider operational concerns: vendor lock-in (proprietary function-calling formats), SLA guarantees, regional availability, and whether the provider trains on your data by default.

Common Misconception

The model with the highest benchmark score is always the right choice — in practice the cheapest model that passes your task-specific evaluation set is usually the right choice, since cost and latency dominate at scale.

Why It Matters

Model choice is the single largest lever on the cost and latency of an AI feature; picking wrong inflates your bill by 10x and slows responses to seconds, while picking right unlocks features that would otherwise be uneconomic.

Common Mistakes

  • Choosing the most powerful frontier model for every task without testing whether a smaller model is sufficient.
  • Picking a model based on public benchmarks (MMLU, HumanEval) instead of an evaluation set built from your actual production data.
  • Ignoring p99 latency — the model averages well but tail responses time out user requests.
  • Not checking data privacy terms — some providers train on customer data by default unless you opt out or use an enterprise tier.
  • Locking into provider-specific features (proprietary tool-calling schemas) without an abstraction layer, making future migration painful.

Avoid When

  • Prototyping or early-stage exploration — pick a capable default and defer selection until you have real usage data.
  • The task is one-off or low-volume — the engineering effort to compare models exceeds the savings.

When To Use

  • Before launching an AI feature to production with meaningful traffic, where model cost or latency will impact the business.
  • When AI API spend becomes a noticeable line item — re-run selection quarterly as new model variants ship.
  • When latency SLAs are tight and a smaller, faster model could meet quality requirements.
  • When data residency, privacy, or licensing constraints rule out certain providers and you need to compare the viable subset.

Code Examples

✗ Vulnerable
// Default to the biggest frontier model for every task
$response = $llm->complete([
    'model'  => 'gpt-4o',  // expensive, slow, overkill for classification
    'prompt' => "Classify this support ticket as: billing, technical, account, other.\n\n{$ticket}",
]);
// No evaluation, no cost tracking, no comparison to cheaper alternatives
✓ Fixed
// Evaluate candidates against real data before committing
$candidates = [
    ['model' => 'gpt-4o-mini',              'cost_per_1m_in' => 0.15],
    ['model' => 'claude-3-5-haiku-latest',  'cost_per_1m_in' => 1.00],
    ['model' => 'gpt-4o',                   'cost_per_1m_in' => 2.50],
];

$evalSet = ProductionTickets::sample(200); // real labelled data

foreach ($candidates as $candidate) {
    $scores = [];
    $latencies = [];
    foreach ($evalSet as $ticket) {
        $start = microtime(true);
        $prediction = $llm->classify($candidate['model'], $ticket->text);
        $latencies[] = (microtime(true) - $start) * 1000;
        $scores[]    = $prediction === $ticket->label ? 1 : 0;
    }
    $results[$candidate['model']] = [
        'accuracy'    => array_sum($scores) / count($scores),
        'p99_ms'      => Stats::percentile($latencies, 99),
        'cost_per_1m' => $candidate['cost_per_1m_in'],
    ];
}

// Pick cheapest model meeting 95% accuracy and p99 < 2000ms
$chosen = ModelPicker::cheapestPassing($results, accuracy: 0.95, p99Ms: 2000);

Added 21 May 2026
Views 22
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 3 pings F 2 pings S 1 ping S 2 pings M 1 ping T 2 pings W 2 pings T 0 pings F 0 pings S 0 pings S 0 pings M 1 ping T 0 pings W
No pings yet today
Perplexity 5 Google 4 ChatGPT 2 Amazonbot 1 Ahrefs 1 SEMrush 1 Meta AI 1 Bing 1
crawler 13 crawler_json 3
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: Medium
⚡ Quick Fix
Build a 50-200 example evaluation set from real production data, run 3-5 candidate models against it, and pick the cheapest model that meets your quality bar at acceptable latency
📦 Applies To
any web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
Hardcoded model name (e.g. 'gpt-4o', 'claude-opus') across the codebase with no configuration abstraction and no evaluation harness in the repository
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Medium False Positives: Medium ✗ Manual fix Fix: Medium Context: File Tests: Update

✓ schema.org compliant