Knowledge Distillation
Also Known As
TL;DR
Explanation
Knowledge distillation was introduced by Hinton et al. (2015) as a way to transfer knowledge from a large, expensive model (the teacher) into a small, fast model (the student). The key insight is that the teacher's soft probability outputs — the full distribution over all classes, not just the winning label — carry richer information than hard labels. A dog image might score 0.85 dog, 0.10 wolf, 0.04 fox: the near-misses tell the student about visual similarity that a hard label (dog) discards. In the LLM era, distillation is used to create smaller models that approximate a frontier model's capabilities: the student is trained on the teacher's output token distributions (or sampled completions) rather than on raw human-labelled data. Examples: DistilBERT is a 40% smaller, 60% faster version of BERT retaining 97% of performance; Mistral and Phi model families use distillation-inspired techniques. For software engineers, the practical question is usually 'should I call a large expensive model or a small distilled one?' — the trade-off is cost/latency vs quality. Distillation is also used for task-specific fine-tuning: generate thousands of examples with GPT-4, then fine-tune a smaller model on them to create a cheap specialist.
Diagram
flowchart TD
TEACHER[Large Teacher Model<br/>e.g. GPT-4 70B] -->|soft probability outputs| DISTILL[Distillation Training]
HARDLABEL[Human Labels] -->|optional supplement| DISTILL
DISTILL --> STUDENT[Small Student Model<br/>e.g. 7B]
subgraph Tradeoffs
QUAL[95 percent teacher quality]
COST[10-50x cheaper inference]
LAT[Lower latency]
end
STUDENT --> QUAL & COST & LAT
style TEACHER fill:#f85149,color:#fff
style STUDENT fill:#238636,color:#fff
style COST fill:#238636,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Distilling on the wrong task distribution — a student trained on the teacher's general outputs will not inherit specialist performance on domain-specific tasks not well-represented in the training data.
- Using hard labels (sampled completions only) instead of soft labels (full token probability distributions) — soft labels transfer significantly more information.
- Evaluating the student only on benchmark metrics without testing on real production inputs — benchmark gains may not transfer to your specific workload.
- Skipping temperature calibration — a teacher's soft outputs at T=1 may be too peaked or too flat for effective distillation; T=4–10 is common to soften distributions.
Avoid When
- When task diversity is high and unpredictable — distilled specialists underperform on out-of-distribution inputs.
- When you lack sufficient representative training examples — distillation quality depends heavily on coverage of the target task distribution.
When To Use
- Use a large model to generate thousands of high-quality examples for your specific task, then fine-tune a smaller model — you get specialist performance at commodity cost.
- Route narrow, repetitive tasks (classification, extraction, summarisation of a fixed schema) to a distilled specialist rather than a frontier model.
- Evaluate the distilled model on your actual production inputs before committing — benchmark numbers are necessary but not sufficient.
- Consider distillation when inference cost or latency is a bottleneck and the task is well-defined enough to collect representative training data.