Understanding LLM Distillation techniques – MarkTechPost

0 4 4 minutes read

Understanding LLM Distillation techniques – MarkTechPost

Modern types of large languages are no longer trained only on raw Internet text. Increasingly, companies are using powerful “teacher” models to help train smaller or more efficient “student” models. This process, widely known as LLM distillation or model to model traininghas become the primary method for building high-performance models at low computational cost. Meta used its large Llama 4 Behemoth model to help train the Llama 4 Scout and Maverick, while Google used Gemini models during the development of Gemma 2 and Gemma 3. Similarly, DeepSeek dispersed the imaging power from DeepSeek-R1 to smaller models based on Qwen and Llama.

The main idea is simple: instead of learning only from a text written by a person, the learner model can also learn from the output, probability, tracking, or behavior of another LLM. This allows small models to gain skills such as reasoning, following instructions, and orderly production from much larger systems. Abortion may occur during pre-training, when teachers and students are trained together, or during training, when a fully trained teacher transfers knowledge to a different student model.

In this article, we will examine the three main methods used to train LLM one by one: Soft-label distillationwhen the student learns from the teacher's distribution of opportunities; Distillation with a strong labelwhen the student imitates the results produced by the teacher; again Co-distillationwhere multiple models learn cooperatively by sharing predictions and behaviors during training.

Soft-Label Distillation

Soft-label distillation is a training method where small student LLM it learns by simulating the probability distribution of larger outputs teacher LLM. Instead of training on only the next correct token, the learner is trained to match the teacher's softmax probabilities to every word. For example, if the teacher predicts the next token with similar probabilities “cat” = 70%, “dog” = 20%again “animal” = 10%the reader not only learns the final answer, but also the relationship and uncertainty between the different tokens. This rich representation is often called the “dark knowledge” of the teacher because it contains hidden information about thinking patterns and semantic understanding.

A major advantage of soft-label distillation is that it allows smaller models to gain the capabilities of larger models while still being faster and cheaper to transport. As the student learns from the teacher's full distribution of opportunities, the training becomes more stable and more instructive compared to learning from only one strong target word. However, this approach also comes with practical challenges. To generate soft labels, you need access to the teacher model's logit or weights, which is often not possible with closed source models. Furthermore, storing the probability distribution of all tokens in all words containing 100k+ tokens becomes memory-intensive at LLM scale, making soft-label extraction very expensive for datasets of billions of tokens.

Distillation with a strong label

Strong-label distillation is a simple method where the LLM student only studies the last predicted token of the teacher's model instead of the full probability distribution. In this setup, a pre-trained teacher model generates the next most likely token or response, and the learner model is trained using supervised learning to reproduce that output. The teacher essentially acts as a high-quality annotator who creates synthetic data for the student's training. DeepSeek used this method to roll out imaging capabilities from the DeepSeek-R1 to the smaller Qwen and Llama 3.1 models.

Unlike soft-label distillation, the reader does not see the teacher's internal trust points or token relationships — he only reads the final answer. This makes hard-label distillation less expensive and easier to use since there is no need to maintain a large probability distribution for all tokens. It's also especially useful when working with “black-box” models like the GPT-4 APIs, where developers can only access the generated text and not the underlying logs. Although hard labels contain less information than soft labels, they are still very efficient for instruction planning, reasoning data sets, synthetic data generation, and domain-specific fine-tuning tasks.

Co-distillation

Co-distillation is a training method where both the teacher and students are trained together instead of using a pre-trained teacher. In this setup, the teacher LLM and the student LLM process the same training data simultaneously and generate their own softmax probability distribution. The teacher is typically trained using hard ground-truth labels, while the student learns by matching the teacher's soft labels with appropriate responses. Meta used a version of this method when training Llama 4 Scout and Maverick next to a large model of Llama 4 Behemoth.

Another challenge with co-distillation is that the teacher model is not fully trained at first, meaning that its predictions may initially be noisy or inaccurate. To overcome this, the reader is usually trained using a combination of soft-label distillation loss and standard hard-label cross-entropy loss. This creates a stable learning signal while allowing information transfer between models. Unlike traditional one-way distillation, co-distillation allows both models to develop together during training, often resulting in better performance, stronger transfer of thinking, and smaller performance gaps between teacher and student models.

Comparing Three Abortion Techniques

Soft-label distillation conveys a richer type of knowledge because the student learns from the teacher's full distribution of possibilities instead of just the final response. This helps the micro-models capture patterns of reasoning, uncertainty, and relationships between tokens, often resulting in stronger overall performance. However, it is computationally expensive, requires access to teacher logit or weights, and is difficult to scale because storing the probability distribution of large terms consumes a lot of memory.

Hard-label distillation is simple and very efficient. The student only learns from the final results produced by the teacher, which makes it very cheap and easy to use. It works especially well with black box models such as GPT-4 APIs where internal possibilities are not available. Although this method loses the deep “dark information” present in soft labels, it is still very effective for instruction tuning, synthetic data generation, and task specific configuration.

Co-distillation takes a collaborative approach where teacher and student models learn together during training. The teacher develops while simultaneously guiding the student, allowing both models to benefit from shared learning signals. This can reduce the performance gap seen in traditional one-way abortion methods, but it also makes training more difficult since the teacher's prediction is initially unstable. In practice, soft-label distillation is chosen for more information transfer, hard-label distillation for robustness and performance, and co-distillation for large-scale collective training.

I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in Data Science, especially Neural Networks and its application in various fields.

Source link

nimda 3 weeks ago

0 4 4 minutes read