Xiaohan Xu1 Ming Li2 Chongyang Tao3 Tao Shen4 Reynold Cheng1 Jinyang Li1 Can Xu5 Dacheng Tao6 Tianyi Zhou2
1 The University of Hong Kong 2 University of Maryland 3 Microsoft 4 University of Technology Sydney 5 Peking University 6 The University of Sydney
A collection of papers related to knowledge distillation of large language models (LLMs). If you want to use LLMs for benefitting your own smaller models training, or use self-generated knowledge to achieve the self-improvement, just take a look at this collection.
We will update this collection every week. Welcome to star ⭐️ this repo to keep track of the updates.
❗️Legal Consideration: It's crucial to note the legal implications of utilizing LLM outputs, such as those from ChatGPT (Restrictions), Llama (License), etc. We strongly advise users to adhere to the terms of use specified by the model providers, such as the restrictions on developing competitive products, and so on.
-
2024-2-20: 📃 We released a survey paper "A Survey on Knowledge Distillation of Large Language Models". Welcome to read and cite it. We are looking forward to your feedback and suggestions.
-
Update Log
- 2024-3-19: Add 14 papers.
Feel free to open an issue/PR or e-mail shawnxxh@gmail.com, minglii@umd.edu, hishentao@gmail.com and chongyangtao@gmail.com if you find any missing taxonomies or papers. We will keep updating this collection and survey.
KD of LLMs: This survey delves into knowledge distillation (KD) techniques in Large Language Models (LLMs), highlighting KD's crucial role in transferring advanced capabilities from proprietary LLMs like GPT-4 to open-source counterparts such as LLaMA and Mistral. We also explore how KD enables the compression and self-improvement of open-source LLMs by using them as teachers.
KD and Data Augmentation: Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts.
Taxonomy: Our analysis is meticulously structured around three foundational pillars: algorithm, skill, and verticalization -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields.
KD Algorithms: For KD algorithms, we categorize it into two principal steps: "Knowledge Elicitation" focusing on eliciting knowledge from teacher LLMs, and "Distillation Algorithms" centered on injecting this knowledge into student models.
Skill Distillation: We delve into the enhancement of specific cognitive abilities, such as context following, alignment, agent, NLP task specialization, and multi-modality.
Verticalization Distillation: We explore the practical implications of KD across diverse fields, including law, medical & healthcare, finance, science, and miscellaneous domains.
Note that both Skill Distillation and Verticalization Distillation employ Knowledge Elicitation and Distillation Algorithms in KD Algorithms to achieve their KD. Thus, there are overlaps between them. However, this could also provide different perspectives for the papers.
In the era of LLMs, KD of LLMs plays the following crucial roles:
Role | Description | Trend |
---|---|---|
① Advancing Smaller Models | Transferring advanced capabilities from proprietary LLMs to smaller models, such as open source LLMs or other smaller models. | Most common |
② Compression | Compressing open-source LLMs to make them more efficient and practical. | More popular with the prosperity of open-source LLMs |
③ Self-Improvement | Refining open-source LLMs' performance by leveraging their own knowledge, i.e. self-knowledge. | New trend to make open-source LLMs more competitive |
Due to the large number of works applying supervised fine-tuning, we only list the most representative ones here.
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering | arXiv | 2024-03 | ||
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models | arXiv | 2024-02 | Github | |
Self-Rewarding Language Models | arXiv | 2024-01 | Github | |
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models | arXiv | 2024-01 | Github | Data |
Zephyr: Direct Distillation of Language Model Alignment | arXiv | 2023-10 | Github | Data |
CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment | arXiv | 2023-10 |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Zephyr: Direct Distillation of LM Alignment | arXiv | 2023-10 | Github | Data |
OPENCHAT: ADVANCING OPEN-SOURCE LANGUAGE MODELS WITH MIXED-QUALITY DATA | ICLR | 2023-09 | Github | Data |
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations | arXiv | 2023-05 | Github | Data |
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data | EMNLP | 2023-04 | Github | Data |
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality* | - | 2023-03 | Github | Data |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | NIPS | 2023-10 | Github | Data |
SAIL: Search-Augmented Instruction Learning | arXiv | 2023-05 | Github | Data |
Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks | NIPS | 2023-05 | Github | Data |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Aligning Large and Small Language Models via Chain-of-Thought Reasoning | EACL | 2024-03 | Github | |
Divide-or-Conquer? Which Part Should You Distill Your LLM? | arXiv | 2024-02 | ||
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning | arXiv | 2024-02 | Github | Data |
Can LLMs Speak For Diverse People? Tuning LLMs via Debate to Generate Controllable Controversial Statements | arXiv | 2024-02 | Github | Data |
Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering | arXiv | 2023-11 | Github | |
Orca 2: Teaching Small Language Models How to Reason | arXiv | 2023-11 | ||
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning | NIPS Workshop | 2023-10 | Github | Data |
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 | arXiv | 2023-06 | ||
SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation | arXiv | 2023-05 |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Ultrafeedback: Boosting language models with high-quality feedback | arXiv | 2023-10 | Github | Data |
Zephyr: Direct Distillation of LM Alignment | arXiv | 2023-10 | Github | Data |
Rlaif: Scaling Reinforcement Learning from Human Feedback with AI Feedback | arXiv | 2023-09 | ||
OPENCHAT: ADVANCING OPEN-SOURCE LANGUAGE MODELS WITH MIXED-QUALITY DATA | ICLR | 2023-09 | Github | Data |
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment | arXiv | 2023-07 | Github | |
Aligning Large Language Models through Synthetic Feedbacks | EMNLP | 2023-05 | Github | Data |
Reward Design with Language Models | ICLR | 2023-03 | Github | |
Training Language Models with Language Feedback at Scale | arXiv | 2023-03 | ||
Constitutional AI: Harmlessness from AI Feedback | arXiv | 2022-12 |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Ultrafeedback: Boosting language models with high-quality feedback | arXiv | 2023-10 | Github | Data |
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment | arXiv | 2023-07 | Github | |
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision | NeurIPS | 2023-05 | Github | Data |
Training Socially Aligned Language Models on Simulated Social Interactions | arXiv | 2023-05 | ||
Constitutional AI: Harmlessness from AI Feedback | arXiv | 2022-12 |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Tailoring Self-Rationalizers with Multi-Reward Distillation | arXiv | 2023-11 | Github | Data |
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation | arXiv | 2023-10 | Github | |
Neural Machine Translation Data Generation and Augmentation using ChatGPT | arXiv | 2023-07 | ||
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes | ICLR | 2023-06 | ||
Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations? | arXiv | 2023-06 | Github | Data |
InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT | EMNLP | 2023-05 | ||
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing | arXiv | 2023-05 | Github | |
Data Augmentation for Radiology Report Simplification | Findings of EACL | 2023-04 | Github | |
Want To Reduce Labeling Cost? GPT-3 Can Help | Findings of EMNLP | 2021-08 |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Can Small Language Models be Good Reasoners for Sequential Recommendation? | arXiv | 2024-03 | ||
Large Language Model Augmented Narrative Driven Recommendations | arXiv | 2023-06 | ||
Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach | arXiv | 2023-05 | ||
ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models | WSDM | 2023-05 | Github | Data |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models | ICLR | 2023-10 | Github | Data |
TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks | arXiv | 2023-10 | Github | Data |
Generative Judge for Evaluating Alignment | ICLR | 2023-10 | Github | Data |
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization | arXiv | 2023-06 | Github | Data |
INSTRUCTSCORE: Explainable Text Generation Evaluation with Fine-grained Feedback | EMNLP | 2023-05 | Github | Data |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Magicoder: Source Code Is All You Need | arXiv | 2023-12 | Github | Data Data |
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation | arXiv | 2023-12 | ||
Instruction Fusion: Advancing Prompt Evolution through Hybridization | arXiv | 2023-12 | ||
MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning | arXiv | 2023-11 | Github | Data Data |
LLM-Assisted Code Cleaning For Training Accurate Code Generators | arXiv | 2023-11 | ||
Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation | EMNLP | 2023-10 | Github | |
Code Llama: Open Foundation Models for Code | arXiv | 2023-08 | Github | |
Distilled GPT for Source Code Summarization | arXiv | 2023-08 | Github | Data |
Textbooks Are All You Need: A Large-Scale Instructional Text Data Set for Language Models | arXiv | 2023-06 | ||
Code Alpaca: An Instruction-following LLaMA model for code generation | - | 2023-03 | Github | Data |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Fuzi | - | 2023-08 | Github | |
ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases | arXiv | 2023-06 | Github | |
Lawyer LLaMA Technical Report | arXiv | 2023-05 | Github | Data |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters | CIKM | 2023-05 |
Title | Venue | Date | Code | Data |
---|---|---|---|---|
OWL: A Large Language Model for IT Operations | arXiv | 2023-09 | Github | Data |
EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education | arXiv | 2023-08 | Github | Data |
Note: Our survey mainly focuses on generative LLMs, and thus the encoder-based KD is not included in the survey. However, we are also interested in this topic and continue to update the latest works in this area.
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling | Findings of ACL | 2023-08 | ||
Better Together: Jointly Using Masked Latent Semantic Modeling and Masked Language Modeling for Sample Efficient Pre-training | CoNLL | 2023-08 |
If you find this repository helpful, please consider citing the following paper:
@misc{xu2024survey,
title={A Survey on Knowledge Distillation of Large Language Models},
author={Xiaohan Xu and Ming Li and Chongyang Tao and Tao Shen and Reynold Cheng and Jinyang Li and Can Xu and Dacheng Tao and Tianyi Zhou},
year={2024},
eprint={2402.13116},
archivePrefix={arXiv},
primaryClass={cs.CL}
}