LLM Temperature & Sampling Strategies
Also Known As
TL;DR
Explanation
When an LLM generates the next token it produces a probability distribution over its entire vocabulary. Sampling strategies determine how a token is chosen from that distribution. Temperature (T) scales the logits before softmax: T < 1 sharpens the distribution (the most probable tokens become even more dominant, output is more predictable and repetitive), T > 1 flattens it (lower-probability tokens become more likely, output is more creative but less coherent), T = 0 is greedy decoding (always picks the highest-probability token, fully deterministic). Top-p (nucleus sampling) discards all tokens outside the smallest set whose cumulative probability exceeds p — e.g. top-p 0.9 considers only the tokens that together account for 90% of the probability mass, dynamically adjusting the candidate pool size. Top-k hard-limits the candidate pool to the k most probable tokens regardless of their probabilities. In practice: use T ≈ 0 / top-p 1.0 for factual extraction, code generation, or structured output where correctness matters; use T 0.7–1.0 for creative writing or brainstorming; never use T > 1.0 in production — it frequently produces incoherent output. Temperature and top-p interact: setting both low is doubly restrictive and often unnecessary.
Diagram
flowchart LR
LOGITS[Raw Logits from Model] --> TEMP[Apply Temperature T]
TEMP -->|T=0 greedy| GREEDY[Pick max token<br/>deterministic]
TEMP -->|T=0.7 balanced| SOFT[Softmax distribution]
SOFT --> TOPP[Top-p filter<br/>keep 90% probability mass]
TOPP --> SAMPLE[Sample token]
SAMPLE --> OUTPUT[Generated token]
subgraph Settings
LOW[T close to 0<br/>factual structured]
MED[T 0.7-0.9<br/>creative coherent]
HIGH[T above 1.0<br/>avoid in production]
end
style GREEDY fill:#0d419d,color:#fff
style HIGH fill:#f85149,color:#fff
style LOW fill:#238636,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Using the API default temperature for every use-case — defaults are a compromise; structured output tasks need T≈0, creative tasks need T≈0.8.
- Setting top-p and top-k simultaneously without understanding they compound — use one or the other, not both.
- Using T > 1.0 in production believing it adds creativity — it primarily adds incoherence.
- Not testing at the target temperature — a prompt that works at T=0 may fail badly at T=0.9.
Avoid When
- Setting temperature above 1.0 in any production context — it degrades coherence without meaningfully improving creativity.
- Setting both top-p and top-k simultaneously — they compound in ways that are hard to reason about; pick one.
When To Use
- Use temperature 0 (or close to it) for structured data extraction, code generation, classification, and any task where output format consistency matters.
- Use temperature 0.7–0.9 with top-p 0.9–0.95 for creative writing, brainstorming, and summarisation tasks.
- Always test your prompt at the exact temperature you will use in production — behaviour changes significantly.
- Document the temperature setting alongside your prompt in version control so future changes are deliberate.
Code Examples
// Using default temperature for structured data extraction
$response = $llm->complete(
prompt: 'Extract the invoice total from this text: ' . $invoiceText
// No temperature set — API default is often 0.7-1.0
// Risk: model may output varied, non-parseable formats
);
// Match temperature to the task
// Structured extraction — deterministic, parseable
$structured = $llm->complete(
prompt: 'Extract invoice data as JSON: {"total": number, "currency": string}\n\n' . $invoiceText,
temperature: 0.0,
top_p: 1.0
);
// Creative generation — diverse but coherent
$creative = $llm->complete(
prompt: 'Write a product description for: ' . $productName,
temperature: 0.8,
top_p: 0.95
// Do NOT also set top_k — compounding restrictions hurt quality
);