When should you NOT use Knowledge Distillation?

When task diversity is high and unpredictable — distilled specialists underperform on out-of-distribution inputs. When you lack sufficient representative training examples — distillation quality depends heavily on coverage of the target task distribution.

When is Knowledge Distillation the right choice?

Use a large model to generate thousands of high-quality examples for your specific task, then fine-tune a smaller model — you get specialist performance at commodity cost. Route narrow, repetitive tasks (classification, extraction, summarisation of a fixed schema) to a distilled specialist rather than a frontier model. Evaluate the distilled model on your actual production inputs before committing — benchmark numbers are necessary but not sufficient. Consider distillation when inference cost or latency is a bottleneck and the task is well-defined enough to collect representative training data.

← Back to glossary

Knowledge Distillation

ai_ml Advanced

Also Known As

knowledge distillation model compression teacher-student training model distillation

TL;DR

A compression technique where a smaller 'student' model is trained to mimic the outputs of a larger 'teacher' model, achieving comparable performance at a fraction of the inference cost.

Explanation

Knowledge distillation was introduced by Hinton et al. (2015) as a way to transfer knowledge from a large, expensive model (the teacher) into a small, fast model (the student). The key insight is that the teacher's soft probability outputs — the full distribution over all classes, not just the winning label — carry richer information than hard labels. A dog image might score 0.85 dog, 0.10 wolf, 0.04 fox: the near-misses tell the student about visual similarity that a hard label (dog) discards. In the LLM era, distillation is used to create smaller models that approximate a frontier model's capabilities: the student is trained on the teacher's output token distributions (or sampled completions) rather than on raw human-labelled data. Examples: DistilBERT is a 40% smaller, 60% faster version of BERT retaining 97% of performance; Mistral and Phi model families use distillation-inspired techniques. For software engineers, the practical question is usually 'should I call a large expensive model or a small distilled one?' — the trade-off is cost/latency vs quality. Distillation is also used for task-specific fine-tuning: generate thousands of examples with GPT-4, then fine-tune a smaller model on them to create a cheap specialist.

Diagram

flowchart TD
    TEACHER[Large Teacher Model<br/>e.g. GPT-4 70B] -->|soft probability outputs| DISTILL[Distillation Training]
    HARDLABEL[Human Labels] -->|optional supplement| DISTILL
    DISTILL --> STUDENT[Small Student Model<br/>e.g. 7B]
    subgraph Tradeoffs
        QUAL[95 percent teacher quality]
        COST[10-50x cheaper inference]
        LAT[Lower latency]
    end
    STUDENT --> QUAL & COST & LAT
style TEACHER fill:#f85149,color:#fff
style STUDENT fill:#238636,color:#fff
style COST fill:#238636,color:#fff

Common Misconception

✗ Distillation always produces a significantly worse model — well-executed distillation on the right task often achieves 95%+ of teacher performance at 10–50x lower inference cost.

Why It Matters

Running a 70B parameter model for every user request is prohibitively expensive at scale — distilled models make AI features economically viable for high-traffic applications.

Common Mistakes

Distilling on the wrong task distribution — a student trained on the teacher's general outputs will not inherit specialist performance on domain-specific tasks not well-represented in the training data.
Using hard labels (sampled completions only) instead of soft labels (full token probability distributions) — soft labels transfer significantly more information.
Evaluating the student only on benchmark metrics without testing on real production inputs — benchmark gains may not transfer to your specific workload.
Skipping temperature calibration — a teacher's soft outputs at T=1 may be too peaked or too flat for effective distillation; T=4–10 is common to soften distributions.

Avoid When

When task diversity is high and unpredictable — distilled specialists underperform on out-of-distribution inputs.
When you lack sufficient representative training examples — distillation quality depends heavily on coverage of the target task distribution.

When To Use

Use a large model to generate thousands of high-quality examples for your specific task, then fine-tune a smaller model — you get specialist performance at commodity cost.
Route narrow, repetitive tasks (classification, extraction, summarisation of a fixed schema) to a distilled specialist rather than a frontier model.
Evaluate the distilled model on your actual production inputs before committing — benchmark numbers are necessary but not sufficient.
Consider distillation when inference cost or latency is a bottleneck and the task is well-defined enough to collect representative training data.

Knowledge Distillation

Also Known As

TL;DR

Explanation

Diagram

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

References

Tags

Knowledge Distillation

Also Known As

TL;DR

Explanation

Diagram

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

References

Tags

Related Terms