← Back to glossary

Temperature & Sampling in LLMs

ai_ml Beginner

Also Known As

temperature sampling temperature top-p top-k sampling nucleus sampling

TL;DR

Temperature controls how random an LLM's output is — low values (0–0.3) produce predictable, conservative responses; high values (0.7–1.0) produce creative but less reliable outputs.

Explanation

LLMs generate text by computing a probability distribution over possible next tokens, then sampling from that distribution. Temperature is a scaling factor applied before sampling — temperature 1.0 leaves the distribution unchanged, values below 1.0 sharpen it (making the highest-probability tokens more likely), and values above 1.0 flatten it (making low-probability tokens more likely). Top-P sampling (nucleus sampling) is an alternative that samples from the smallest set of tokens whose cumulative probability exceeds P — more adaptive than temperature alone. Top-K limits sampling to the K most probable tokens. In practice: temperature 0 for deterministic tasks (code generation, structured extraction), temperature 0.3–0.7 for conversational responses, temperature 0.7–1.0 for creative writing.

Common Misconception

✗ Temperature 0 makes the model deterministic and always produces the same output. Temperature 0 makes the model take the highest-probability token at each step, which is nearly deterministic, but infrastructure-level factors (floating-point rounding, batching, hardware differences) mean outputs can still vary slightly across runs. For truly deterministic outputs, seed parameters are available on some providers.

Why It Matters

Setting temperature correctly is one of the highest-impact and lowest-effort LLM configuration decisions. Too high on a code generation task and the model invents function names; too low on a creative writing task and the output is repetitive and generic. PHP developers integrating LLMs should set temperature per use-case in their configuration rather than using a single global value — extraction and classification tasks should be near 0, conversational features around 0.5.

Common Mistakes

Using the same temperature for all tasks — set it per use-case based on whether the task needs consistency or creativity.
Setting temperature very high to get 'more creative' outputs and then being surprised by factual errors.
Confusing temperature and top-p — they interact; using both high values simultaneously produces very random outputs.
Not logging temperature settings alongside LLM outputs — makes debugging inconsistent outputs much harder.

Code Examples

✗ Vulnerable

// ❌ High temperature for deterministic tasks (code gen, data extraction)
$response = $client->messages->create([
    'model' => 'claude-sonnet-4-20250514',
    'max_tokens' => 500,
    'temperature' => 1.0, // maximum randomness
    'messages' => [[
        'role' => 'user',
        'content' => 'Extract the order ID and total from this invoice: ...'
        // Will produce different (wrong) JSON structures on every call
    ]]
]);

✓ Fixed

// ✅ Low temperature for deterministic extraction tasks
$response = $client->messages->create([
    'model'       => 'claude-sonnet-4-20250514',
    'max_tokens'  => 500,
    'temperature' => 0.0, // Deterministic — same output every run
    'messages'    => [[
        'role'    => 'user',
        'content' => 'Extract the order ID and total from this invoice '
                   . 'and return JSON only: {"order_id": ..., "total": ...}\n\n'
                   . $invoiceText
    ]]
]);

// Rule of thumb:
// 0.0–0.2 — extraction, classification, structured output, code generation
// 0.3–0.6 — Q&A, summarisation, analysis
// 0.7–1.0 — creative writing, brainstorming, variation generation

References

↗ https://docs.anthropic.com/en/api/getting-started