LLM Temperature & Sampling Strategies
debt(d9/e1/b3/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints field explicitly states automated=no, and the only code pattern hint is an LLM API call with a missing or too-high temperature parameter. There is no linter, compiler, or SAST tool that catches this — degraded output quality only manifests when users interact with the deployed system, and even then it may be attributed to prompting rather than sampling settings.
Closest to 'one-line patch or single-call swap' (e1). The quick_fix is explicit: set temperature=0 for structured tasks, 0.7–0.9 for creative tasks. This is a single parameter change in one API call, requiring no refactoring of surrounding code.
Closest to 'localised tax' (b3). The sampling setting applies at the LLM API call site. If a codebase has many LLM call sites (web, cli, queue-worker as listed in applies_to), each independently needs the right temperature, creating a localised but repeating tax. The rest of the codebase is unaffected, but every LLM call site carries this obligation, pushing slightly above b1.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states that developers believe 'higher temperature always produces better creative output,' when in fact T > ~1.2 produces incoherence. This is a serious cognitive trap because the intuitive mapping (more randomness = more creativity) is wrong beyond a moderate range, and the failure mode (plausible-looking but broken or incoherent output) is subtle enough to pass casual review.
Also Known As
TL;DR
Explanation
When an LLM generates the next token it produces a probability distribution over its entire vocabulary. Sampling strategies determine how a token is chosen from that distribution. Temperature (T) scales the logits before softmax: T < 1 sharpens the distribution (the most probable tokens become even more dominant, output is more predictable and repetitive), T > 1 flattens it (lower-probability tokens become more likely, output is more creative but less coherent), T = 0 is greedy decoding (always picks the highest-probability token, fully deterministic). Top-p (nucleus sampling) discards all tokens outside the smallest set whose cumulative probability exceeds p — e.g. top-p 0.9 considers only the tokens that together account for 90% of the probability mass, dynamically adjusting the candidate pool size. Top-k hard-limits the candidate pool to the k most probable tokens regardless of their probabilities. In practice: use T ≈ 0 / top-p 1.0 for factual extraction, code generation, or structured output where correctness matters; use T 0.7–1.0 for creative writing or brainstorming; never use T > 1.0 in production — it frequently produces incoherent output. Temperature and top-p interact: setting both low is doubly restrictive and often unnecessary.
Diagram
flowchart LR
LOGITS[Raw Logits from Model] --> TEMP[Apply Temperature T]
TEMP -->|T=0 greedy| GREEDY[Pick max token<br/>deterministic]
TEMP -->|T=0.7 balanced| SOFT[Softmax distribution]
SOFT --> TOPP[Top-p filter<br/>keep 90% probability mass]
TOPP --> SAMPLE[Sample token]
SAMPLE --> OUTPUT[Generated token]
subgraph Settings
LOW[T close to 0<br/>factual structured]
MED[T 0.7-0.9<br/>creative coherent]
HIGH[T above 1.0<br/>avoid in production]
end
style GREEDY fill:#0d419d,color:#fff
style HIGH fill:#f85149,color:#fff
style LOW fill:#238636,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Using the API default temperature for every use-case — defaults are a compromise; structured output tasks need T≈0, creative tasks need T≈0.8.
- Setting top-p and top-k simultaneously without understanding they compound — use one or the other, not both.
- Using T > 1.0 in production believing it adds creativity — it primarily adds incoherence.
- Not testing at the target temperature — a prompt that works at T=0 may fail badly at T=0.9.
Avoid When
- Setting temperature above 1.0 in any production context — it degrades coherence without meaningfully improving creativity.
- Setting both top-p and top-k simultaneously — they compound in ways that are hard to reason about; pick one.
When To Use
- Use temperature 0 (or close to it) for structured data extraction, code generation, classification, and any task where output format consistency matters.
- Use temperature 0.7–0.9 with top-p 0.9–0.95 for creative writing, brainstorming, and summarisation tasks.
- Always test your prompt at the exact temperature you will use in production — behaviour changes significantly.
- Document the temperature setting alongside your prompt in version control so future changes are deliberate.
Code Examples
// Using default temperature for structured data extraction
$response = $llm->complete(
prompt: 'Extract the invoice total from this text: ' . $invoiceText
// No temperature set — API default is often 0.7-1.0
// Risk: model may output varied, non-parseable formats
);
// Match temperature to the task
// Structured extraction — deterministic, parseable
$structured = $llm->complete(
prompt: 'Extract invoice data as JSON: {"total": number, "currency": string}\n\n' . $invoiceText,
temperature: 0.0,
top_p: 1.0
);
// Creative generation — diverse but coherent
$creative = $llm->complete(
prompt: 'Write a product description for: ' . $productName,
temperature: 0.8,
top_p: 0.95
// Do NOT also set top_k — compounding restrictions hurt quality
);