Reasoning Models & Test-Time Compute
Also Known As
TL;DR
Explanation
Reasoning models (OpenAI o1/o3, DeepSeek R1, Claude with extended thinking, Gemini 2 Thinking) differ from standard LLMs in that they are trained — typically with reinforcement learning on verifiable problems — to produce long internal reasoning chains before their visible answer. The 'test-time compute' framing reframes inference as a budget-allocation problem: spend more tokens reasoning to spend fewer attempts overall, similar to a person thinking before speaking. This is architecturally and training-wise distinct from chain-of-thought (CoT) prompting, which is a runtime technique applied to any model. CoT relies on the prompt to elicit reasoning; reasoning models do it natively and often hide or summarize the reasoning tokens (OpenAI o-series), or expose them as a structured 'thinking' block (Claude extended thinking). Practical impact: significant accuracy gains on tasks with verifiable answers (math benchmarks, code generation, logic puzzles), modest or negligible gains on creative or open-ended tasks, and substantially higher cost and latency per call. Choosing a reasoning model is a cost/latency/quality trade — wrong for chat UX with tight latency budgets, right for backend code analysis or planning.
Common Misconception
Why It Matters
Common Mistakes
- Routing every request to a reasoning model — wastes tokens and adds latency on tasks that don't benefit (lookups, classification, simple Q&A).
- Setting low max_tokens — reasoning models need a generous budget to produce both their reasoning and their answer; truncated outputs hide the answer.
- Trying to inspect or rely on hidden reasoning tokens — most providers redact or summarize them; build on the visible answer.
- Comparing reasoning-model benchmarks against standard-model benchmarks without controlling for inference compute — reasoning models can use 10–100× more tokens per response.
- Streaming reasoning models the same way as standard models — first-token latency is much higher because reasoning happens before any visible output.
Avoid When
- Latency-sensitive chat UX where time-to-first-token matters more than answer depth.
- Simple lookups, classification, or short-form generation where reasoning adds cost without benefit.
- Tasks where the answer quality is subjective (creative writing, brainstorming) — reasoning training is grounded in verifiable rewards and offers little advantage there.
When To Use
- Tasks with verifiable correctness: math, code generation, structured planning, debugging.
- Backend or batch workloads where latency is tolerable and quality matters more than throughput.
- When standard-model outputs are routinely wrong on a class of problem and prompt engineering has plateaued.
Code Examples
// ❌ Routing every request to a reasoning model regardless of task
foreach ($requests as $req) {
$response = $client->messages->create([
'model' => 'claude-opus-4-7',
'max_tokens' => 200, // too low — reasoning + answer won't fit
'thinking' => ['type' => 'enabled', 'budget_tokens' => 10000],
'messages' => [['role' => 'user', 'content' => $req->prompt]]
]);
// Simple lookups pay the full reasoning premium for no benefit;
// truncated max_tokens may cut off the answer entirely.
}
// ✅ Route by task complexity; size token budgets for the chosen mode
function needsReasoning(string $prompt): bool {
$signals = ['debug', 'analyze', 'prove', 'derive', 'plan', 'why does', 'step by step'];
foreach ($signals as $s) {
if (stripos($prompt, $s) !== false) return true;
}
return false;
}
foreach ($requests as $req) {
$useReasoning = needsReasoning($req->prompt);
$params = [
'model' => 'claude-opus-4-7',
'max_tokens' => $useReasoning ? 8000 : 1000,
'messages' => [['role' => 'user', 'content' => $req->prompt]]
];
if ($useReasoning) {
$params['thinking'] = ['type' => 'enabled', 'budget_tokens' => 6000];
}
$response = $client->messages->create($params);
}