Trendy massive language fashions are now not educated solely on uncooked web textual content. More and more, corporations are utilizing highly effective “trainer” fashions to assist practice smaller or extra environment friendly “scholar” fashions. This course of, broadly generally known as LLM distillation or model-to-model coaching, has turn into a key method for constructing high-performing fashions at decrease computational price. Meta used its huge Llama 4 Behemoth mannequin to assist practice Llama 4 Scout and Maverick, whereas Google leveraged Gemini fashions through the improvement of Gemma 2 and Gemma 3. Equally, DeepSeek distilled reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama-based fashions.
The core thought is easy: as an alternative of studying solely from human-written textual content, a scholar mannequin may study from the outputs, possibilities, reasoning traces, or behaviors of one other LLM. This enables smaller fashions to inherit capabilities comparable to reasoning, instruction following, and structured era from a lot bigger programs. Distillation can occur throughout pre-training, the place trainer and scholar fashions are educated collectively, or throughout post-training, the place a completely educated trainer transfers data to a separate scholar mannequin.
On this article, we’ll discover three main approaches used for coaching one LLM utilizing one other: Tender-label distillation, the place the scholar learns from the trainer’s likelihood distributions; Exhausting-label distillation, the place the scholar imitates the trainer’s generated outputs; and Co-distillation, the place a number of fashions study collaboratively by sharing predictions and behaviors throughout coaching.


Tender-Label Distillation
Tender-label distillation is a coaching method the place a smaller scholar LLM learns by imitating the output likelihood distribution of a bigger trainer LLM. As a substitute of coaching solely on the proper subsequent token, the scholar is educated to match the trainer’s softmax possibilities throughout all the vocabulary. For instance, if the trainer predicts the following token with possibilities like “cat” = 70%, “canine” = 20%, and “animal” = 10%, the scholar learns not simply the ultimate reply, but in addition the relationships and uncertainty between completely different tokens. This richer sign is usually known as the trainer’s “darkish data” as a result of it accommodates hidden details about reasoning patterns and semantic understanding.
The most important benefit of soft-label distillation is that it permits smaller fashions to inherit capabilities from a lot bigger fashions whereas remaining sooner and cheaper to deploy. For the reason that scholar learns from the trainer’s full likelihood distribution, coaching turns into extra steady and informative in comparison with studying from arduous one-word targets alone. Nevertheless, this methodology additionally comes with sensible challenges. To generate delicate labels, you want entry to the trainer mannequin’s logits or weights, which is usually not potential with closed-source fashions. As well as, storing likelihood distributions for each token throughout vocabularies containing 100k+ tokens turns into extraordinarily memory-intensive at LLM scale, making pure soft-label distillation costly for trillion-token datasets.


Exhausting-label distillation
Exhausting-label distillation is a less complicated strategy the place the scholar LLM learns solely from the trainer mannequin’s last predicted output token as an alternative of its full likelihood distribution. On this setup, a pre-trained trainer mannequin generates the most definitely subsequent token or response, and the scholar mannequin is educated utilizing customary supervised studying to breed that output. The trainer basically acts as a high-quality annotator that creates artificial coaching information for the scholar. DeepSeek used this strategy to distill reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama 3.1 fashions.
Not like soft-label distillation, the scholar doesn’t see the trainer’s inner confidence scores or token relationships — it solely learns the ultimate reply. This makes hard-label distillation computationally less expensive and simpler to implement since there is no such thing as a must retailer huge likelihood distributions for each token. It’s also particularly helpful when working with proprietary “black-box” fashions like GPT-4 APIs, the place builders solely have entry to generated textual content and never the underlying logits. Whereas arduous labels comprise much less data than delicate labels, they continue to be extremely efficient for instruction tuning, reasoning datasets, artificial information era, and domain-specific fine-tuning duties.


Co-distillation
Co-distillation is a coaching strategy the place each the trainer and scholar fashions are educated collectively as an alternative of utilizing a hard and fast pre-trained trainer. On this setup, the trainer LLM and scholar LLM course of the identical coaching information concurrently and generate their very own softmax likelihood distributions. The trainer is educated usually utilizing the ground-truth arduous labels, whereas the scholar learns by matching the trainer’s delicate labels together with the precise right solutions. Meta used a type of this strategy whereas coaching Llama 4 Scout and Maverick alongside the bigger Llama 4 Behemoth mannequin.
One problem with co-distillation is that the trainer mannequin just isn’t totally educated through the early levels, which means its predictions could initially be noisy or inaccurate. To beat this, the scholar is often educated utilizing a mixture of soft-label distillation loss and customary hard-label cross-entropy loss. This creates a extra steady studying sign whereas nonetheless permitting data switch between fashions. Not like conventional one-way distillation, co-distillation permits each fashions to enhance collectively throughout coaching, typically main to higher efficiency, stronger reasoning switch, and smaller efficiency gaps between the trainer and scholar fashions.


Evaluating the Three Distillation Methods
Tender-label distillation transfers the richest type of data as a result of the scholar learns from the trainer’s full likelihood distribution as an alternative of solely the ultimate reply. This helps smaller fashions seize reasoning patterns, uncertainty, and relationships between tokens, typically resulting in stronger total efficiency. Nevertheless, it’s computationally costly, requires entry to the trainer’s logits or weights, and turns into tough to scale as a result of storing likelihood distributions for enormous vocabularies consumes huge reminiscence.
Exhausting-label distillation is less complicated and extra sensible. The coed solely learns from the trainer’s last generated outputs, making it less expensive and simpler to implement. It really works particularly effectively with proprietary black-box fashions like GPT-4 APIs the place inner possibilities are unavailable. Whereas this strategy loses a few of the deeper “darkish data” current in delicate labels, it stays extremely efficient for instruction tuning, artificial information era, and task-specific fine-tuning.
Co-distillation takes a collaborative strategy the place trainer and scholar fashions study collectively throughout coaching. The trainer improves whereas concurrently guiding the scholar, permitting each fashions to profit from shared studying alerts. This will cut back the efficiency hole seen in conventional one-way distillation strategies, however it additionally makes coaching extra complicated for the reason that trainer’s predictions are initially unstable. In apply, soft-label distillation is most popular for max data switch, hard-label distillation for scalability and practicality, and co-distillation for large-scale joint coaching setups.










