AI Model Selection Criteria
debt(d9/e7/b7/t7)
Closest to 'silent in production until users hit it' (d9), detection_hints.automated is no — there's no tool that flags 'you picked the wrong model'; it shows up as inflated bills or latency complaints in production.
Closest to 'cross-cutting refactor across the codebase' (e7), quick_fix requires building an eval set and re-testing, but if the codebase hardcodes model names everywhere (per code_pattern) and locks into provider-specific tool-calling schemas, swapping models is a cross-cutting change touching every AI call site.
Closest to 'strong gravitational pull' (b7), applies to web/cli/queue-worker — model choice shapes cost structure, latency budgets, prompt engineering, and provider lock-in across the entire AI surface of the system; every new feature is shaped by it.
Closest to 'serious trap' (t7), the misconception that highest-benchmark = best choice contradicts the actual economics — competent devs reach for frontier models by default, and the cheapest-passing-eval heuristic is counterintuitive until you've been burned by a bill or latency SLA.
Also Known As
TL;DR
Explanation
Choosing an LLM for a production feature is an engineering trade-off, not a benchmark beauty contest. The dimensions that matter in practice are: task-specific capability (does it actually solve your task on your data, not just on MMLU), cost per million input and output tokens, latency at p50 and p99, context window size, supported modalities (text, vision, audio), structured output reliability (JSON mode, function calling), rate limits and quota, data residency and privacy (can you send your data to this provider), licensing terms for outputs, fine-tuning availability, and hosting options (managed API vs self-hosted open weights). A common anti-pattern is defaulting to the largest frontier model for every task — most production workloads (classification, extraction, summarisation, routing) run well on smaller, cheaper, faster models. Conversely, teams sometimes pick a tiny model to save costs and then spend weeks on prompt engineering to compensate for capability gaps. The right method is to define an evaluation set drawn from real production data, score 3-5 candidate models against it on accuracy, cost, and latency, and pick the cheapest model that meets your quality bar. Re-evaluate quarterly: model providers ship faster, cheaper variants frequently, and a model swap can cut your bill significantly. Also consider operational concerns: vendor lock-in (proprietary function-calling formats), SLA guarantees, regional availability, and whether the provider trains on your data by default.
Common Misconception
Why It Matters
Common Mistakes
- Choosing the most powerful frontier model for every task without testing whether a smaller model is sufficient.
- Picking a model based on public benchmarks (MMLU, HumanEval) instead of an evaluation set built from your actual production data.
- Ignoring p99 latency — the model averages well but tail responses time out user requests.
- Not checking data privacy terms — some providers train on customer data by default unless you opt out or use an enterprise tier.
- Locking into provider-specific features (proprietary tool-calling schemas) without an abstraction layer, making future migration painful.
Avoid When
- Prototyping or early-stage exploration — pick a capable default and defer selection until you have real usage data.
- The task is one-off or low-volume — the engineering effort to compare models exceeds the savings.
When To Use
- Before launching an AI feature to production with meaningful traffic, where model cost or latency will impact the business.
- When AI API spend becomes a noticeable line item — re-run selection quarterly as new model variants ship.
- When latency SLAs are tight and a smaller, faster model could meet quality requirements.
- When data residency, privacy, or licensing constraints rule out certain providers and you need to compare the viable subset.
Code Examples
// Default to the biggest frontier model for every task
$response = $llm->complete([
'model' => 'gpt-4o', // expensive, slow, overkill for classification
'prompt' => "Classify this support ticket as: billing, technical, account, other.\n\n{$ticket}",
]);
// No evaluation, no cost tracking, no comparison to cheaper alternatives
// Evaluate candidates against real data before committing
$candidates = [
['model' => 'gpt-4o-mini', 'cost_per_1m_in' => 0.15],
['model' => 'claude-3-5-haiku-latest', 'cost_per_1m_in' => 1.00],
['model' => 'gpt-4o', 'cost_per_1m_in' => 2.50],
];
$evalSet = ProductionTickets::sample(200); // real labelled data
foreach ($candidates as $candidate) {
$scores = [];
$latencies = [];
foreach ($evalSet as $ticket) {
$start = microtime(true);
$prediction = $llm->classify($candidate['model'], $ticket->text);
$latencies[] = (microtime(true) - $start) * 1000;
$scores[] = $prediction === $ticket->label ? 1 : 0;
}
$results[$candidate['model']] = [
'accuracy' => array_sum($scores) / count($scores),
'p99_ms' => Stats::percentile($latencies, 99),
'cost_per_1m' => $candidate['cost_per_1m_in'],
];
}
// Pick cheapest model meeting 95% accuracy and p99 < 2000ms
$chosen = ModelPicker::cheapestPassing($results, accuracy: 0.95, p99Ms: 2000);