13  Distillation

Here is a paradox: GPT-4 can solve complex math problems, generate working code, and write poetry, but it requires a data center to run. Meanwhile, your phone has a neural engine that can run a 3B model in real time. The question is: can you transfer GPT-4's “knowledge” into a model small enough to fit on your phone? This is the promise of knowledge distillation, and the answer, remarkably, is “partially yes.”

Knowledge distillation (Hinton et al. 2015) is one of the most elegant ideas in machine learning. It was proposed by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in 2015, and it has become a cornerstone of practical AI deployment. The core insight is deceptively simple: a larger model's outputs carry more information than raw training labels, and a smaller model can learn more effectively by imitating the larger model than by learning from the original data alone.

13.1 Classical Knowledge Distillation

In the original formulation, a large pre-trained teacher model generates “soft labels”: probability distributions over all possible outputs. A smaller student model is then trained to match these soft labels rather than (or in addition to) the hard ground-truth labels.

Why are soft labels better? Consider a digit classifier. Given an image of a “7”, the hard label says “this is a 7, nothing else.” But the teacher's soft output might say “95% chance of 7, 3% chance of 1, 1% chance of 9, 0.5% chance of 4.” This soft distribution reveals the teacher's knowledge about which mistakes are more reasonable: a 7 looks more like a 1 than a 3. These “dark knowledge” relationships are invisible in hard labels.

TipHinton's Metaphor

Imagine a master craftsman training an apprentice. The master does not just show the correct answer; they demonstrate the process, reveal common pitfalls, and share intuitions about which approaches are promising and which are dead ends. The apprentice learns far more from watching the master work than from just seeing correct examples. Soft labels play the same role: they are the teacher model's “body language,” conveying knowledge that goes beyond the visible answer.

The training loss combines the standard hard-label loss with a distillation loss: \[\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{hard}} + (1-\alpha) \cdot T^2 \cdot \text{KL}\!\left(\sigma\!\left(\frac{z_T}{T}\right) \;\Big\|\; \sigma\!\left(\frac{z_S}{T}\right)\right)\] where \(z_T\) and \(z_S\) are the teacher and student logits, \(T\) is the temperature parameter, \(\sigma\) is the softmax function, and \(\alpha\) balances the two objectives.

The temperature \(T\) is crucial: at \(T=1\), the teacher's distribution is peaked (the correct class dominates). At higher temperatures (\(T=4, 10, 20\)), the distribution is “softened,” making the relationships between classes more visible. The \(T^2\) factor compensates for the reduced gradient magnitude at high temperatures.

NoteChoosing the Temperature

A common rule of thumb: start with \(T=4\) and experiment. Too low (close to 1), and the soft labels are too similar to hard labels, so distillation adds little value. Too high (above 20), and the distribution is nearly uniform, washing out the useful information. The optimal temperature depends on the teacher's confidence: a less certain teacher benefits from lower temperatures, while a highly confident teacher benefits from higher ones.

13.2 Distilling Large Language Models

Distilling LLMs presents unique challenges compared to classical distillation (Xu et al. 2024):

13.2.1 Logit-Based LLM Distillation

The direct approach: run the teacher on the training corpus and have the student match the teacher's next-token probability distribution at every position. This transfers the teacher's full predictive distribution, including its uncertainty about ambiguous continuations.

The practical challenge: storing logits for an entire training corpus is expensive. For a 32K-token vocabulary and millions of training sequences, the logit storage can exceed the size of the training data itself. Solutions include caching only the top-\(k\) logits, using online distillation (generating logits on-the-fly), or compressing the logits.

13.2.2 Data-Based Distillation (API Distillation)

What if you do not have access to the teacher's weights or logits? You can still distill by using the teacher as a data generator. Feed prompts to the teacher's API, collect the responses, and train the student on this synthetic data via standard supervised fine-tuning.

This is how many of the most successful open models were created:

  • Alpaca (Taori et al. 2023): Stanford fine-tuned LLaMA 7B on 52K instruction-response pairs generated by GPT-3.5.
  • Vicuna: Fine-tuned LLaMA on 70K conversations shared by ChatGPT users.
  • Microsoft's Phi series: Used carefully curated synthetic data from GPT-4 to train small but remarkably capable models (1.3B to 14B parameters).
  • DeepSeek-R1 distillation: DeepSeek distilled their RL-trained reasoning model into smaller students (1.5B to 70B parameters), transferring chain-of-thought reasoning capabilities.
CautionThe Licensing Catch

Many frontier model terms of service explicitly prohibit using their outputs to train competing models. OpenAI's terms, for example, prohibit using GPT-4 outputs to “develop any artificial intelligence models that compete with our products and services.” Always check licensing before distilling from commercial APIs. Open-weight models (LLaMA, Mistral, Qwen) with permissive licenses are safer choices as teachers.

13.2.3 On-Policy Distillation

In on-policy distillation, the student generates its own outputs, and the teacher provides feedback. This is closer to how reinforcement learning from human feedback (RLHF) works: instead of a human labeler, the teacher model scores the student's outputs. The teacher's log-probabilities serve as a reward signal, guiding the student toward the teacher's behavior.

This approach has an advantage over offline distillation: the student learns from its own distribution of outputs, not the teacher's. This reduces train-test distribution mismatch and often produces better results, especially for tasks where the student's failure modes differ from the teacher's.

13.3 Self-Distillation

In self-distillation, the teacher and student share the same architecture. The model is trained, then its outputs are used as soft labels to train a fresh copy of the same model. This process can be repeated across multiple “generations”: \[\text{Model}_0 \xrightarrow{\text{distill}} \text{Model}_1 \xrightarrow{\text{distill}} \text{Model}_2 \xrightarrow{\text{distill}} ...\]

Counterintuitively, this can improve performance even without a larger teacher. The likely explanation: self-distillation acts as a regularizer, smoothing the model's learned representations and reducing overfitting to noisy training labels. The soft labels from the previous generation “average out” some of the noise in the original hard labels.

TipBorn-Again Neural Networks

Furlanello et al. coined the term “Born-Again Networks” for models trained through self-distillation. They showed that the student consistently outperforms the teacher, even though they have the exact same architecture. The student benefits from the softer, more informative training signal. It is like giving a student a textbook written by a slightly more experienced version of themselves.

13.4 Dataset Distillation

A different take on distillation: instead of compressing the model, compress the dataset. Dataset distillation learns a small set of synthetic training examples that, when used for training, produce a model that performs as well as one trained on the full dataset.

The idea is appealing: if you can represent the essence of ImageNet (1.2M images) in just 10,000 synthetic images, you can train new models much faster. Current methods include gradient matching (the synthetic data should produce similar gradients to the real data) and trajectory matching (the training trajectory on synthetic data should match that on real data).

13.5 Progressive and Multi-Stage Distillation

Distilling a 70B model directly into a 1B student often fails: the capacity gap is simply too large. Progressive distillation solves this by distilling in stages:

\[\text{70B} \xrightarrow{\text{distill}} \text{13B} \xrightarrow{\text{distill}} \text{7B} \xrightarrow{\text{distill}} \text{1B}\]

Each stage transfers knowledge to a model that is just small enough to benefit from the teacher but close enough in capacity to absorb most of the information. The intermediate models serve as “teaching assistants” that translate the teacher's knowledge into a form the final student can learn.

NoteThe Teaching Assistant Analogy

A Nobel laureate explaining quantum mechanics to a first-year undergraduate often fails, not because the laureate lacks knowledge, but because the gap in understanding is too large. A graduate student (the teaching assistant) bridges the gap: they understand both the laureate's insights and the undergraduate's confusion. Multi-stage distillation works the same way.

Layer-wise distillation is a related technique where the student matches not just the final output, but the intermediate representations at each layer. The student's layer \(i\) is trained to produce activations similar to the teacher's layer \(j\) (where \(j\) is a corresponding layer, often determined by a simple mapping like \(j = i \times \text{depth_ratio}\)). This provides a much richer training signal than output-only distillation.

Attention transfer specifically asks the student to match the teacher's attention patterns. Since attention maps encode syntactic and semantic relationships, transferring them helps the student learn how to process information, not just what answers to produce.

13.6 Evaluating Distilled Models

How do you know if distillation worked? The evaluation must go beyond a single accuracy number:

Perplexity on held-out data: The most basic metric. Compare the student's perplexity against (a) the teacher, (b) a model of the same size trained from scratch, and (c) a fine-tuned model. The distilled student should outperform the from-scratch model.

Task-specific benchmarks: Evaluate on downstream tasks (MMLU, HumanEval, GSM8K) to measure whether the teacher's capabilities transferred. Some capabilities distill more easily than others: factual recall transfers well, but complex multi-step reasoning often does not.

Calibration: A well-distilled model should be calibrated: when it says it is 80% confident, it should be right about 80% of the time. Distillation can either improve or degrade calibration depending on how it is done.

Edge cases and robustness: Test the student on adversarial inputs, out-of-distribution data, and edge cases. Distilled models sometimes inherit the teacher's strengths but not its robustness.

TipThe Gap Analysis

The most useful evaluation is a gap analysis: for each task category, measure the performance gap between teacher and student. You will typically find that some capabilities transfer almost perfectly (e.g., basic QA, summarization) while others do not (e.g., complex reasoning, code generation). This gap analysis tells you exactly where the student's weaknesses lie and whether additional fine-tuning is needed.

13.7 Distillation vs. Fine-Tuning

Both distillation and fine-tuning adapt a model's behavior, but they differ in important ways:

  • Fine-tuning optimizes against ground-truth labels. It requires labeled data specific to your task.
  • Distillation optimizes against a teacher's distribution. It requires a capable teacher but not necessarily task-specific labels.
  • Distillation transfers richer information: The teacher's soft labels encode uncertainty, inter-class relationships, and implicit world knowledge that hard labels cannot capture.
  • They are often combined: Distill a teacher into a student, then fine-tune the student on task-specific data. This gives you both broad knowledge transfer and task-specific optimization.

13.8 Practical Considerations

Teacher-student capacity gap: If the student is too small relative to the teacher, distillation fails. A 124M student cannot absorb all the knowledge of a 175B teacher. The solution: use an intermediate-sized “teaching assistant” to bridge the gap, distilling in stages (175B \(\to\) 13B \(\to\) 1.3B).

Which layers to match: Beyond matching output logits, you can match intermediate representations (hidden states, attention patterns). This “hint-based” distillation provides additional training signal but requires the teacher and student to have compatible architectures.

Curriculum: Start with easy examples and gradually increase difficulty. The student learns the basics from easy examples and refines its understanding on hard ones. This mirrors how human education works.

13.9 Exercises

  1. Distill GPT-2 Large (774M) into GPT-2 Small (124M) using logit-based KD. Compare the student's perplexity on a held-out set against (a) a GPT-2 Small trained from scratch on the same data, and (b) a GPT-2 Small fine-tuned on the data. Does the distilled student outperform both?
  2. Generate 50,000 instruction-response pairs using a large open model (e.g., LLaMA 3 70B or Qwen 2.5 72B). Fine-tune a 7B model on these synthetic examples and evaluate on MMLU (Hendrycks et al. 2021). Compare with the same 7B model fine-tuned on an equivalent amount of human-written data.
  3. Experiment with different temperature values (\(T = 1, 2, 4, 10, 20\)). Plot the student's final perplexity as a function of temperature. What is the optimal value for your setup?
  4. Perform self-distillation: take a trained GPT-2 Small, use its outputs as soft labels, and train a new GPT-2 Small on these soft labels. Repeat for three generations. Does performance improve with each generation? When does it plateau?
  5. Find a case where distillation fails: try to distill a 70B model into a 0.5B model in one step. Measure the quality gap. Then try two-stage distillation (70B \(\to\) 7B \(\to\) 0.5B). Does staging help?

References

Hendrycks, Dan, Collin Burns, Steven Basart, et al. 2021. “Measuring Massive Multitask Language Understanding.” arXiv Preprint arXiv:2009.03300.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv Preprint arXiv:1503.02531.
Taori, Rohan, Ishaan Gulrajani, Tianyi Zhang, et al. 2023. “Stanford Alpaca: An Instruction-Following LLaMA Model.” GitHub.
Xu, Xiaohan et al. 2024. “A Survey on Knowledge Distillation of Large Language Models.” arXiv Preprint arXiv:2402.13116.