{
    "slug": "model_distillation",
    "term": "Knowledge Distillation",
    "category": "ai_ml",
    "difficulty": "advanced",
    "short": "A compression technique where a smaller 'student' model is trained to mimic the outputs of a larger 'teacher' model, achieving comparable performance at a fraction of the inference cost.",
    "long": "Knowledge distillation was introduced by Hinton et al. (2015) as a way to transfer knowledge from a large, expensive model (the teacher) into a small, fast model (the student). The key insight is that the teacher's soft probability outputs — the full distribution over all classes, not just the winning label — carry richer information than hard labels. A dog image might score 0.85 dog, 0.10 wolf, 0.04 fox: the near-misses tell the student about visual similarity that a hard label (dog) discards. In the LLM era, distillation is used to create smaller models that approximate a frontier model's capabilities: the student is trained on the teacher's output token distributions (or sampled completions) rather than on raw human-labelled data. Examples: DistilBERT is a 40% smaller, 60% faster version of BERT retaining 97% of performance; Mistral and Phi model families use distillation-inspired techniques. For software engineers, the practical question is usually 'should I call a large expensive model or a small distilled one?' — the trade-off is cost/latency vs quality. Distillation is also used for task-specific fine-tuning: generate thousands of examples with GPT-4, then fine-tune a smaller model on them to create a cheap specialist.",
    "aliases": [
        "knowledge distillation",
        "model compression",
        "teacher-student training",
        "model distillation"
    ],
    "tags": [
        "ai",
        "llm",
        "inference",
        "optimisation",
        "machine-learning"
    ],
    "misconception": "Distillation always produces a significantly worse model — well-executed distillation on the right task often achieves 95%+ of teacher performance at 10–50x lower inference cost.",
    "why_it_matters": "Running a 70B parameter model for every user request is prohibitively expensive at scale — distilled models make AI features economically viable for high-traffic applications.",
    "common_mistakes": [
        "Distilling on the wrong task distribution — a student trained on the teacher's general outputs will not inherit specialist performance on domain-specific tasks not well-represented in the training data.",
        "Using hard labels (sampled completions only) instead of soft labels (full token probability distributions) — soft labels transfer significantly more information.",
        "Evaluating the student only on benchmark metrics without testing on real production inputs — benchmark gains may not transfer to your specific workload.",
        "Skipping temperature calibration — a teacher's soft outputs at T=1 may be too peaked or too flat for effective distillation; T=4–10 is common to soften distributions."
    ],
    "when_to_use": [
        "Use a large model to generate thousands of high-quality examples for your specific task, then fine-tune a smaller model — you get specialist performance at commodity cost.",
        "Route narrow, repetitive tasks (classification, extraction, summarisation of a fixed schema) to a distilled specialist rather than a frontier model.",
        "Evaluate the distilled model on your actual production inputs before committing — benchmark numbers are necessary but not sufficient.",
        "Consider distillation when inference cost or latency is a bottleneck and the task is well-defined enough to collect representative training data."
    ],
    "avoid_when": [
        "When task diversity is high and unpredictable — distilled specialists underperform on out-of-distribution inputs.",
        "When you lack sufficient representative training examples — distillation quality depends heavily on coverage of the target task distribution."
    ],
    "related": [
        "large_language_models",
        "fine_tuning",
        "ai_cost_management",
        "llm_temperature_sampling",
        "machine_learning_types"
    ],
    "prerequisites": [
        "large_language_models",
        "machine_learning_types",
        "fine_tuning"
    ],
    "refs": [
        "https://arxiv.org/abs/1503.02531",
        "https://huggingface.co/docs/transformers/model_doc/distilbert"
    ],
    "quick_fix": "Use a large model to generate a labelled dataset for your specific task, then fine-tune a smaller model on it — this is practical distillation without needing access to model internals",
    "severity": "info",
    "effort": "high",
    "created": "2026-03-29",
    "updated": "2026-03-29",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/model_distillation",
        "html_url": "https://codeclaritylab.com/glossary/model_distillation",
        "json_url": "https://codeclaritylab.com/glossary/model_distillation.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Knowledge Distillation](https://codeclaritylab.com/glossary/model_distillation) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/model_distillation"
            }
        }
    }
}