Knowledge Distillation
debt(d9/e7/b5/t3)
Closest to 'silent in production until users hit it' (d9). The detection_hints state 'automated: no' and describe the pattern as 'large frontier model API calls on high-volume, narrow tasks where a fine-tuned smaller model would suffice.' There is no tool that flags this — it only surfaces when inference costs or latency become painful at scale in production. No linter, SAST, or type checker can detect this suboptimal architectural choice.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes generating a labelled dataset and fine-tuning a smaller model — this is not a one-line patch. It requires: collecting representative production data, running teacher model inference at scale, handling soft labels and temperature calibration, training and validating a student model, swapping out model-serving infrastructure, and re-evaluating on real workloads. This touches ML pipelines, infrastructure, and application code across multiple components.
Closest to 'persistent productivity tax' (b5). The choice to run large frontier models at inference time imposes ongoing cost and latency overhead across every request in web, CLI, and queue-worker contexts (all listed in applies_to). However, once a distilled model is deployed, the burden is largely lifted — it does not permanently shape every future code change the way a foundational architectural choice would, placing it at b5 rather than b7.
Closest to 'minor surprise (one edge case)' (t3). The misconception field states developers wrongly believe 'distillation always produces a significantly worse model,' when well-executed distillation often achieves 95%+ of teacher performance. This is a real but relatively contained misconception — it causes under-adoption rather than catastrophic misuse. Common mistakes (wrong task distribution, hard vs soft labels) are documented gotchas but not deeply counterintuitive to an ML-familiar developer.
Also Known As
TL;DR
Explanation
Knowledge distillation was introduced by Hinton et al. (2015) as a way to transfer knowledge from a large, expensive model (the teacher) into a small, fast model (the student). The key insight is that the teacher's soft probability outputs — the full distribution over all classes, not just the winning label — carry richer information than hard labels. A dog image might score 0.85 dog, 0.10 wolf, 0.04 fox: the near-misses tell the student about visual similarity that a hard label (dog) discards. In the LLM era, distillation is used to create smaller models that approximate a frontier model's capabilities: the student is trained on the teacher's output token distributions (or sampled completions) rather than on raw human-labelled data. Examples: DistilBERT is a 40% smaller, 60% faster version of BERT retaining 97% of performance; Mistral and Phi model families use distillation-inspired techniques. For software engineers, the practical question is usually 'should I call a large expensive model or a small distilled one?' — the trade-off is cost/latency vs quality. Distillation is also used for task-specific fine-tuning: generate thousands of examples with GPT-4, then fine-tune a smaller model on them to create a cheap specialist.
Diagram
flowchart TD
TEACHER[Large Teacher Model<br/>e.g. GPT-4 70B] -->|soft probability outputs| DISTILL[Distillation Training]
HARDLABEL[Human Labels] -->|optional supplement| DISTILL
DISTILL --> STUDENT[Small Student Model<br/>e.g. 7B]
subgraph Tradeoffs
QUAL[95 percent teacher quality]
COST[10-50x cheaper inference]
LAT[Lower latency]
end
STUDENT --> QUAL & COST & LAT
style TEACHER fill:#f85149,color:#fff
style STUDENT fill:#238636,color:#fff
style COST fill:#238636,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Distilling on the wrong task distribution — a student trained on the teacher's general outputs will not inherit specialist performance on domain-specific tasks not well-represented in the training data.
- Using hard labels (sampled completions only) instead of soft labels (full token probability distributions) — soft labels transfer significantly more information.
- Evaluating the student only on benchmark metrics without testing on real production inputs — benchmark gains may not transfer to your specific workload.
- Skipping temperature calibration — a teacher's soft outputs at T=1 may be too peaked or too flat for effective distillation; T=4–10 is common to soften distributions.
Avoid When
- When task diversity is high and unpredictable — distilled specialists underperform on out-of-distribution inputs.
- When you lack sufficient representative training examples — distillation quality depends heavily on coverage of the target task distribution.
When To Use
- Use a large model to generate thousands of high-quality examples for your specific task, then fine-tune a smaller model — you get specialist performance at commodity cost.
- Route narrow, repetitive tasks (classification, extraction, summarisation of a fixed schema) to a distilled specialist rather than a frontier model.
- Evaluate the distilled model on your actual production inputs before committing — benchmark numbers are necessary but not sufficient.
- Consider distillation when inference cost or latency is a bottleneck and the task is well-defined enough to collect representative training data.