Welcome to the LLM Compression Papers repository! This repository is dedicated to the collection and discussion of academic and industry papers focused on the compression techniques for large language models (LLMs). As the capabilities of LLMs continue to expand, so does their size and complexity. Consequently, efficient compression methods have become crucial for making these models more accessible and practical for real-world applications.
Title | Introduction | Links | Conference | Year | Code |
---|---|---|---|---|---|
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | SparseGPT | paper | ICML | 2023 | Code |
Efficient Large Language Models: A Survey | paper | arxiv | 2023 | Code | |
A Survey on Deep Neural Network Pruning:Taxonomy, Comparison, Analysis, and Recommendations | paper | arxiv | 2023 | Code | |
A Survey on Model Compression for Large Language Models | paper | arxiv | 2023 | ||
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey | paper | arxiv | 2023 | Code | |
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | paper | arxiv | 2023 | ||
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models | paper | arxiv | 2023 | Code | |
A Survey on Transformer Compression | paper | arxiv | 2024 |
Title | Introduction | Links | Conference | Year | Code |
---|---|---|---|---|---|
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | SparseGPT | paper | ICML | 2023 | Code |
A Simple and Effective Pruning Approach for Large Language Models | Wanda | paper | arxiv | 2023 | Code |
LLM-Pruner: On the Structural Pruning of Large Language Models | LLM-Pruner | paper | NeurIPS | 2023 | Code |
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter | paper | NeurIPS | 2023 | Code | |
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | Flash-LLM | paper | VLDB | 2024 | Code |
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models | NASH | paper | EMNLP | 2023 | Code |
Pruning Large Language Models via Accuracy Predictor | paper | arxiv | 2023 | ||
Compressing LLMs: The Truth is Rarely Pure and Never Simple | Compressing LLMs | paper | arxiv | 2023 | |
Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity | paper | arxiv | 2023 | Code | |
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity | paper | arxiv | 2023 | Code | |
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models | Compresso | paper | arxiv | 2023 | Code |
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning | Sheared LLaMA | paper | arxiv | 2023 | Code |
Sparse Finetuning for Inference Acceleration of Large Language Models | paper | arxiv | 2023 | Code | |
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models | paper | arxiv | 2023 | ||
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning | paper | arxiv | 2023 | ||
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs | paper | arxiv | 2023 | ||
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models | paper | arxiv | 2023 | ||
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery | LoRAShear | paper | arxiv | 2023 | Code |
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization | paper | arxiv | 2023 | Code | |
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models | paper | arxiv | 2023 | Code | |
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs | paper | arxiv | 2023 | Code | |
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity | E-Sparse | paper | arxiv | 2023 | |
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs | PERP | paper | arxiv | 2023 | Code |
Fast and Optimal Weight Update for Pruned Large Language Models | paper | arxiv | 2024 | Code | |
SliceGPT: Compress Large Language Models by Deleting Rows and Columns | paper | arxiv | 2024 | Code | |
Shortened LLaMA: A Simple Depth Pruning for Large Language Models | paper | arxiv | 2024 | ||
SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks | paper | arxiv | 2024 | Code | |
HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference | paper | arxiv | 2024 | ||
LaCo: Large Language Model Pruning via Layer Collapse | paper | arxiv | 2024 | ||
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models | paper | arxiv | 2024 | Code | |
EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs | paper | arxiv | 2024 | Code | |
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation | paper | arxiv | 2024 | Code |
Title | Introduction | Links | Conference | Year | Code |
---|---|---|---|---|---|
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | paper | arxiv | 2023 | ||
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | GPTQ | paper | ICLR | 2022 | Code |
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | SmoothQuant | paper | ICML | 2023 | Code |
GPT-Zip: Deep Compression of Finetuned Large Language Models | GPT-Zip | paper | ICML | 2023 | |
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM | paper | ICML | 2023 | Code | |
QLoRA: Efficient Finetuning of Quantized LLMs | QLoRA | paper | NeurIPS | 2023 | Code |
QuIP: 2-Bit Quantization of Large Language Models With Guarantees | QuIP | paper | NeurIPS | 2023 | Code |
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization | paper | NeurIPS | 2023 | ||
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing | paper | NeurIPS | 2023 | Code | |
LLM-FP4: 4-Bit Floating-Point Quantized Transformers | LLM-FP4 | paper | EMNLP | 2023 | Code |
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization | paper | EMNLP | 2023 | ||
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge | Agile-Quant | paper | AAAI | 2024 | |
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks | paper | arxiv | 2023 | Code | |
Watermarking LLMs with Weight Quantization | paper | arxiv | 2023 | Code | |
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | AWQ | paper | arxiv | 2023 | Code |
RPTQ: Reorder-based Post-training Quantization for Large Language Models | RPTQ | paper | arxiv | 2023 | Code |
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation | ZeroQuant-V2 | paper | arxiv | 2023 | |
SqueezeLLM: Dense-and-Sparse Quantization | SqueezeLLM | paper | arxiv | 2023 | Code |
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling | paper | arxiv | 2023 | ||
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models | paper | arxiv | 2023 | ||
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models | LLM-QAT | paper | arxiv | 2023 | |
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression | SpQR | paper | arxiv | 2023 | Code |
OWQ: Lessons learned from activation outliers for weight quantization in large language models | OWQ | paper | arxiv | 2023 | Code |
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study | paper | arxiv | 2023 | Code | |
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats | ZeroQuant-FP | paper | arxiv | 2023 | |
FPTQ: Fine-grained Post-Training Quantization for Large Language Models | FPTQ | paper | arxiv | 2023 | |
QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm | QuantEase | paper | arxiv | 2023 | |
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models | paper | arxiv | 2023 | ||
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs | paper | arxiv | 2023 | Code | |
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models | QA-LoRA | paper | arxiv | 2023 | Code |
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers | ModuLoRA | paper | arxiv | 2023 | |
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM | paper | arxiv | 2023 | ||
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources | QFT | paper | arxiv | 2023 | |
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models | QLLM | paper | arxiv | 2023 | |
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models | LoftQ | paper | ICLR | 2024 | Code |
TEQ: Trainable Equivalent Transformation for Quantization of LLMs | TEQ | paper | arxiv | 2023 | Code |
BitNet: Scaling 1-bit Transformers for Large Language Models | paper | arxiv | 2023 | ||
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | Atom | paper | arxiv | 2023 | |
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models | paper | arxiv | 2023 | ||
AFPQ: Asymmetric Floating Point Quantization for LLMs | paper | arxiv | 2023 | Code | |
A Speed Odyssey for Deployable Quantization of LLMs | paper | arxiv | 2023 | ||
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning | LQ-LoRA | paper | arxiv | 2023 | Code |
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization | paper | arxiv | 2023 | ||
Extreme Compression of Large Language Models via Additive Quantization | AQLM | paper | arxiv | 2023 | Code |
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models QMoE | paper | arxiv | 2023 | Code | |
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | paper | arxiv | 2023 | Code |
Title | Introduction | Links | Conference | Year | Code |
---|---|---|---|---|---|
Specializing Smaller Language Models towards Multi-Step Reasoning | paper | ICML | 2023 | Code | |
Distilling Script Knowledge from Large Language Models for Constrained Language Planning | paper | ACL | 2023 | Code | |
SCOTT: Self-Consistent Chain-of-Thought Distillation | SCOTT | paper | ACL | 2023 | Code |
DISCO: Distilling Counterfactuals with Large Language Models | DISCO | paper | ACL | 2023 | Code |
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation | I2D2 | paper | ACL | 2023 | |
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step | paper | ACL | 2023 | ||
GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model | GKD | paper | ACL | 2023 | Code |
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes | paper | ACL | 2023 | Code | |
Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind | paper | NeurIPS | 2023 | Code | |
Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents | paper | EMNLP | 2023 | Code | |
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation | PromptMix | paper | EMNLP | 2023 | Code |
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression | paper | EMNLP | 2023 | ||
Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models | paper | EMNLP | 2023 | Code | |
Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data | paper | AAAI | 2024 | Code | |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | LaMini-LM | paper | arxiv | 2023 | Code |
Knowledge Distillation of Large Language Models | paper | arxiv | 2023 | Code | |
Teaching Small Language Models to Reason | paper | arxiv | 2023 | ||
Large Language Model Distillation Doesn't Need a Teacher | paper | arxiv | 2023 | Code | |
The False Promise of Imitating Proprietary LLMs | paper | arxiv | |||
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing | paper | arxiv | 2023 | Code | |
PaD: Program-aided Distillation Specializes Large Models in Reasoning | PaD | paper | arxiv | 2023 | |
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment | RLCD | paper | arxiv | 2023 | |
Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA | Sci-CoT | paper | arxiv | 2023 | |
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition | paper | arxiv | 2023 | Code | |
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty | BabyLlama | paper | arxiv | 2023 | Code |
DistillSpec: Improving Speculative Decoding via Knowledge Distillation | paper | arxiv | 2023 | Code | |
Zephyr: Direct Distillation of LM Alignment | paper | arxiv | 2023 | Code | |
Towards the Law of Capacity Gap in Distilling Language Models | paper | arxiv | 2023 | Code | |
Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models | paper | arxiv | 2023 | ||
Mixed Distillation Helps Smaller Language Model Better Reasoning | paper | arxiv | 2023 |
Title | Introduction | Links | Conference | Year | Code |
---|---|---|---|---|---|
TinySAM: Pushing the Envelope for Efficient Segment Anything Model | paper | arxiv | 2023 | Code |