{
    "slug": "reasoning_models",
    "term": "Reasoning Models & Test-Time Compute",
    "category": "ai_ml",
    "difficulty": "intermediate",
    "short": "A class of LLMs trained to allocate extra inference-time compute to internal reasoning before answering, achieving large gains on math, code, and logic at the cost of latency and tokens.",
    "long": "Reasoning models (OpenAI o1/o3, DeepSeek R1, Claude with extended thinking, Gemini 2 Thinking) differ from standard LLMs in that they are trained — typically with reinforcement learning on verifiable problems — to produce long internal reasoning chains before their visible answer. The 'test-time compute' framing reframes inference as a budget-allocation problem: spend more tokens reasoning to spend fewer attempts overall, similar to a person thinking before speaking. This is architecturally and training-wise distinct from chain-of-thought (CoT) prompting, which is a runtime technique applied to any model. CoT relies on the prompt to elicit reasoning; reasoning models do it natively and often hide or summarize the reasoning tokens (OpenAI o-series), or expose them as a structured 'thinking' block (Claude extended thinking). Practical impact: significant accuracy gains on tasks with verifiable answers (math benchmarks, code generation, logic puzzles), modest or negligible gains on creative or open-ended tasks, and substantially higher cost and latency per call. Choosing a reasoning model is a cost/latency/quality trade — wrong for chat UX with tight latency budgets, right for backend code analysis or planning.",
    "aliases": [
        "thinking models",
        "test-time compute",
        "extended thinking",
        "o1-style models",
        "inference-time reasoning",
        "deliberate reasoning models"
    ],
    "tags": [
        "llm",
        "reasoning",
        "inference",
        "ai",
        "models",
        "claude",
        "openai"
    ],
    "misconception": "Reasoning models are just chain-of-thought prompting baked into the model. CoT is a prompting technique on standard models — quality depends on the prompt and the model can ignore the instruction. Reasoning models are *trained* to allocate test-time compute via reinforcement learning, often on problems with verifiable rewards (math, code), producing reasoning chains that the model has been optimized to use effectively rather than merely mimicking.",
    "why_it_matters": "For developers building LLM features, reasoning models change the cost equation: a single reasoning-model call may use 10–50× the tokens of a standard call, but produce results that previously required multiple attempts or human review. Knowing when to route to a reasoning model versus a fast standard model is a primary lever for both quality and budget.",
    "common_mistakes": [
        "Routing every request to a reasoning model — wastes tokens and adds latency on tasks that don't benefit (lookups, classification, simple Q&A).",
        "Setting low max_tokens — reasoning models need a generous budget to produce both their reasoning and their answer; truncated outputs hide the answer.",
        "Trying to inspect or rely on hidden reasoning tokens — most providers redact or summarize them; build on the visible answer.",
        "Comparing reasoning-model benchmarks against standard-model benchmarks without controlling for inference compute — reasoning models can use 10–100× more tokens per response.",
        "Streaming reasoning models the same way as standard models — first-token latency is much higher because reasoning happens before any visible output."
    ],
    "when_to_use": [
        "Tasks with verifiable correctness: math, code generation, structured planning, debugging.",
        "Backend or batch workloads where latency is tolerable and quality matters more than throughput.",
        "When standard-model outputs are routinely wrong on a class of problem and prompt engineering has plateaued."
    ],
    "avoid_when": [
        "Latency-sensitive chat UX where time-to-first-token matters more than answer depth.",
        "Simple lookups, classification, or short-form generation where reasoning adds cost without benefit.",
        "Tasks where the answer quality is subjective (creative writing, brainstorming) — reasoning training is grounded in verifiable rewards and offers little advantage there."
    ],
    "related": [
        "chain_of_thought",
        "large_language_models",
        "ai_cost_management",
        "rlhf"
    ],
    "prerequisites": [
        "large_language_models",
        "chain_of_thought"
    ],
    "refs": [
        "https://openai.com/index/learning-to-reason-with-llms/",
        "https://docs.claude.com/en/docs/build-with-claude/extended-thinking"
    ],
    "bad_code": "// ❌ Routing every request to a reasoning model regardless of task\nforeach ($requests as $req) {\n    $response = $client->messages->create([\n        'model'      => 'claude-opus-4-7',\n        'max_tokens' => 200,  // too low — reasoning + answer won't fit\n        'thinking'   => ['type' => 'enabled', 'budget_tokens' => 10000],\n        'messages'   => [['role' => 'user', 'content' => $req->prompt]]\n    ]);\n    // Simple lookups pay the full reasoning premium for no benefit;\n    // truncated max_tokens may cut off the answer entirely.\n}",
    "good_code": "// ✅ Route by task complexity; size token budgets for the chosen mode\nfunction needsReasoning(string $prompt): bool {\n    $signals = ['debug', 'analyze', 'prove', 'derive', 'plan', 'why does', 'step by step'];\n    foreach ($signals as $s) {\n        if (stripos($prompt, $s) !== false) return true;\n    }\n    return false;\n}\n\nforeach ($requests as $req) {\n    $useReasoning = needsReasoning($req->prompt);\n\n    $params = [\n        'model'      => 'claude-opus-4-7',\n        'max_tokens' => $useReasoning ? 8000 : 1000,\n        'messages'   => [['role' => 'user', 'content' => $req->prompt]]\n    ];\n\n    if ($useReasoning) {\n        $params['thinking'] = ['type' => 'enabled', 'budget_tokens' => 6000];\n    }\n\n    $response = $client->messages->create($params);\n}",
    "example_note": "The simplest router is heuristic; production systems route via a cheap classifier model that decides whether the task warrants reasoning compute.",
    "quick_fix": "Don't enable reasoning by default. Route to reasoning models only for tasks that involve multi-step logic, debugging, or planning, and budget enough max_tokens for both the thinking and the answer.",
    "severity": "info",
    "effort": "low",
    "created": "2026-04-28",
    "updated": "2026-04-28",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/reasoning_models",
        "html_url": "https://codeclaritylab.com/glossary/reasoning_models",
        "json_url": "https://codeclaritylab.com/glossary/reasoning_models.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Reasoning Models & Test-Time Compute](https://codeclaritylab.com/glossary/reasoning_models) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/reasoning_models"
            }
        }
    }
}