← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Knowledge Distillation

ai_ml Advanced

Also Known As

knowledge distillation model compression teacher-student training model distillation

TL;DR

A compression technique where a smaller 'student' model is trained to mimic the outputs of a larger 'teacher' model, achieving comparable performance at a fraction of the inference cost.

Explanation

Knowledge distillation was introduced by Hinton et al. (2015) as a way to transfer knowledge from a large, expensive model (the teacher) into a small, fast model (the student). The key insight is that the teacher's soft probability outputs — the full distribution over all classes, not just the winning label — carry richer information than hard labels. A dog image might score 0.85 dog, 0.10 wolf, 0.04 fox: the near-misses tell the student about visual similarity that a hard label (dog) discards. In the LLM era, distillation is used to create smaller models that approximate a frontier model's capabilities: the student is trained on the teacher's output token distributions (or sampled completions) rather than on raw human-labelled data. Examples: DistilBERT is a 40% smaller, 60% faster version of BERT retaining 97% of performance; Mistral and Phi model families use distillation-inspired techniques. For software engineers, the practical question is usually 'should I call a large expensive model or a small distilled one?' — the trade-off is cost/latency vs quality. Distillation is also used for task-specific fine-tuning: generate thousands of examples with GPT-4, then fine-tune a smaller model on them to create a cheap specialist.

Diagram

flowchart TD
    TEACHER[Large Teacher Model<br/>e.g. GPT-4 70B] -->|soft probability outputs| DISTILL[Distillation Training]
    HARDLABEL[Human Labels] -->|optional supplement| DISTILL
    DISTILL --> STUDENT[Small Student Model<br/>e.g. 7B]
    subgraph Tradeoffs
        QUAL[95 percent teacher quality]
        COST[10-50x cheaper inference]
        LAT[Lower latency]
    end
    STUDENT --> QUAL & COST & LAT
style TEACHER fill:#f85149,color:#fff
style STUDENT fill:#238636,color:#fff
style COST fill:#238636,color:#fff

Common Misconception

Distillation always produces a significantly worse model — well-executed distillation on the right task often achieves 95%+ of teacher performance at 10–50x lower inference cost.

Why It Matters

Running a 70B parameter model for every user request is prohibitively expensive at scale — distilled models make AI features economically viable for high-traffic applications.

Common Mistakes

  • Distilling on the wrong task distribution — a student trained on the teacher's general outputs will not inherit specialist performance on domain-specific tasks not well-represented in the training data.
  • Using hard labels (sampled completions only) instead of soft labels (full token probability distributions) — soft labels transfer significantly more information.
  • Evaluating the student only on benchmark metrics without testing on real production inputs — benchmark gains may not transfer to your specific workload.
  • Skipping temperature calibration — a teacher's soft outputs at T=1 may be too peaked or too flat for effective distillation; T=4–10 is common to soften distributions.

Avoid When

  • When task diversity is high and unpredictable — distilled specialists underperform on out-of-distribution inputs.
  • When you lack sufficient representative training examples — distillation quality depends heavily on coverage of the target task distribution.

When To Use

  • Use a large model to generate thousands of high-quality examples for your specific task, then fine-tune a smaller model — you get specialist performance at commodity cost.
  • Route narrow, repetitive tasks (classification, extraction, summarisation of a fixed schema) to a distilled specialist rather than a frontier model.
  • Evaluate the distilled model on your actual production inputs before committing — benchmark numbers are necessary but not sufficient.
  • Consider distillation when inference cost or latency is a bottleneck and the task is well-defined enough to collect representative training data.

Added 29 Mar 2026
Views 24
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings F 0 pings S 1 ping S 2 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 2 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 1 ping S 0 pings M 0 pings T 0 pings W 1 ping T 0 pings F 0 pings S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S
No pings yet today
No pings yesterday
Amazonbot 6 Perplexity 4 Unknown AI 3 Google 2 ChatGPT 1 Ahrefs 1
crawler 16 crawler_json 1
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Use a large model to generate a labelled dataset for your specific task, then fine-tune a smaller model on it — this is practical distillation without needing access to model internals
📦 Applies To
any web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
Large frontier model API calls on high-volume, narrow tasks where a fine-tuned smaller model would suffice
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: Low ✗ Manual fix Fix: High Context: File Tests: Regenerate

✓ schema.org compliant