Reasoning Models & Test-Time Compute
debt(d8/e2/b3/t6)
Closest to 'silent in production until users hit it' (d8), slightly better than d9 because cost/latency spikes show up in billing and monitoring dashboards. detection_hints.automated is 'no' and there's no linter that flags 'reasoning model used for trivial prompt' — it surfaces as bloated token bills or slow responses.
Closest to 'one-line patch or single-call swap' (e2), slightly worse than e1 because the fix per quick_fix is changing the model parameter / disabling thinking flag, plus possibly adjusting max_tokens. Routing logic may need a small conditional, but it's localized to the LLM call site.
Closest to 'localised tax' (b3). Model selection applies at the LLM call site; while applies_to spans web/cli/queue, the choice is encapsulated in a routing layer and doesn't shape system architecture. Wrong default creates ongoing cost drag but is reversible without rewrites.
Closest to 'serious trap' (t7), nudged to t6 because the misconception (reasoning models = CoT prompting baked in) leads developers to underestimate token costs and misuse low max_tokens, and streaming behavior contradicts standard-model intuition. Multiple documented gotchas in common_mistakes (truncated answers, hidden reasoning tokens, latency profile) all contradict standard-LLM mental models.
Also Known As
TL;DR
Explanation
Reasoning models (OpenAI o1/o3, DeepSeek R1, Claude with extended thinking, Gemini 2 Thinking) differ from standard LLMs in that they are trained — typically with reinforcement learning on verifiable problems — to produce long internal reasoning chains before their visible answer. The 'test-time compute' framing reframes inference as a budget-allocation problem: spend more tokens reasoning to spend fewer attempts overall, similar to a person thinking before speaking. This is architecturally and training-wise distinct from chain-of-thought (CoT) prompting, which is a runtime technique applied to any model. CoT relies on the prompt to elicit reasoning; reasoning models do it natively and often hide or summarize the reasoning tokens (OpenAI o-series), or expose them as a structured 'thinking' block (Claude extended thinking). Practical impact: significant accuracy gains on tasks with verifiable answers (math benchmarks, code generation, logic puzzles), modest or negligible gains on creative or open-ended tasks, and substantially higher cost and latency per call. Choosing a reasoning model is a cost/latency/quality trade — wrong for chat UX with tight latency budgets, right for backend code analysis or planning.
Common Misconception
Why It Matters
Common Mistakes
- Routing every request to a reasoning model — wastes tokens and adds latency on tasks that don't benefit (lookups, classification, simple Q&A).
- Setting low max_tokens — reasoning models need a generous budget to produce both their reasoning and their answer; truncated outputs hide the answer.
- Trying to inspect or rely on hidden reasoning tokens — most providers redact or summarize them; build on the visible answer.
- Comparing reasoning-model benchmarks against standard-model benchmarks without controlling for inference compute — reasoning models can use 10–100× more tokens per response.
- Streaming reasoning models the same way as standard models — first-token latency is much higher because reasoning happens before any visible output.
Avoid When
- Latency-sensitive chat UX where time-to-first-token matters more than answer depth.
- Simple lookups, classification, or short-form generation where reasoning adds cost without benefit.
- Tasks where the answer quality is subjective (creative writing, brainstorming) — reasoning training is grounded in verifiable rewards and offers little advantage there.
When To Use
- Tasks with verifiable correctness: math, code generation, structured planning, debugging.
- Backend or batch workloads where latency is tolerable and quality matters more than throughput.
- When standard-model outputs are routinely wrong on a class of problem and prompt engineering has plateaued.
Code Examples
// ❌ Routing every request to a reasoning model regardless of task
foreach ($requests as $req) {
$response = $client->messages->create([
'model' => 'claude-opus-4-7',
'max_tokens' => 200, // too low — reasoning + answer won't fit
'thinking' => ['type' => 'enabled', 'budget_tokens' => 10000],
'messages' => [['role' => 'user', 'content' => $req->prompt]]
]);
// Simple lookups pay the full reasoning premium for no benefit;
// truncated max_tokens may cut off the answer entirely.
}
// ✅ Route by task complexity; size token budgets for the chosen mode
function needsReasoning(string $prompt): bool {
$signals = ['debug', 'analyze', 'prove', 'derive', 'plan', 'why does', 'step by step'];
foreach ($signals as $s) {
if (stripos($prompt, $s) !== false) return true;
}
return false;
}
foreach ($requests as $req) {
$useReasoning = needsReasoning($req->prompt);
$params = [
'model' => 'claude-opus-4-7',
'max_tokens' => $useReasoning ? 8000 : 1000,
'messages' => [['role' => 'user', 'content' => $req->prompt]]
];
if ($useReasoning) {
$params['thinking'] = ['type' => 'enabled', 'budget_tokens' => 6000];
}
$response = $client->messages->create($params);
}