RLHF — Reinforcement Learning from Human Feedback
Also Known As
TL;DR
Explanation
RLHF is the technique that turned raw pretrained LLMs into the helpful, instruction-following models we use today (ChatGPT, Claude, Gemini). The pipeline runs in three stages: (1) supervised fine-tuning (SFT) on instruction-response pairs to teach basic instruction-following, (2) train a reward model on human-ranked output pairs — given a prompt and two responses, humans pick which is better, and the reward model learns to predict those preferences, (3) optimize the SFT model against the reward model using reinforcement learning (typically PPO, increasingly DPO which skips the explicit reward model). The result is a model that produces outputs humans rate as helpful, harmless, and honest. RLHF does not add new knowledge — capability comes from pretraining. It shapes how the model uses what it already knows: tone, refusal patterns, formatting, willingness to follow instructions. RLHF is also why same-architecture base models behave so differently: a Llama base model and Llama-Instruct have identical pretraining but radically different surface behaviour because of RLHF.
Common Misconception
Why It Matters
Common Mistakes
- Believing RLHF eliminates hallucination — it reduces obvious confabulation but cannot teach the model what it doesn't know.
- Conflating RLHF with the fine-tuning APIs offered to end users — those are SFT or LoRA, not full RLHF.
- Assuming RLHF can be cheaply replicated for a private model — full RLHF requires preference datasets, reward model training, and PPO infrastructure.
- Treating refusals as model bugs rather than trained behaviours — refusal patterns are deliberately induced by RLHF and policy training.
Avoid When
- You expect RLHF to fix factual accuracy issues — it doesn't add knowledge.
- Building safety-critical systems on RLHF refusals alone — combine with explicit guardrails and validation.
When To Use
- Choosing between base and instruction-tuned models for a task.
- Explaining to stakeholders why models refuse certain requests or default to specific tones.
- Designing prompts that work with — not against — a model's trained assistant behaviour.
Code Examples
// ❌ Treating an instruction-tuned model like a raw text completer
// — fighting against RLHF instead of working with it
$response = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 500,
'messages' => [[
'role' => 'user',
'content' => 'The cat sat on the' // expecting raw next-token completion
]]
]);
// Instruction-tuned models will respond conversationally,
// not autocomplete — RLHF trained them to act as assistants.
// ✅ Working with the RLHF'd assistant behaviour
$response = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 500,
'system' => 'You are a creative writing assistant. Continue the story naturally.',
'messages' => [[
'role' => 'user',
'content' => 'Continue this sentence in a vivid, sensory style: "The cat sat on the"'
]]
]);
// Frame the task explicitly. RLHF'd models perform best when
// addressed as assistants with a clear job, not as text completers.