LLM Compression Papers

Introduction

Welcome to the LLM Compression Papers repository! This repository is dedicated to the collection and discussion of academic and industry papers focused on the compression techniques for large language models (LLMs). As the capabilities of LLMs continue to expand, so does their size and complexity. Consequently, efficient compression methods have become crucial for making these models more accessible and practical for real-world applications.

Survey

Title	Introduction	Links	Conference	Year	Code
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot	SparseGPT	paper	ICML	2023	Code
Efficient Large Language Models: A Survey		paper	arxiv	2023	Code
A Survey on Deep Neural Network Pruning:Taxonomy, Comparison, Analysis, and Recommendations		paper	arxiv	2023	Code
A Survey on Model Compression for Large Language Models		paper	arxiv	2023
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey		paper	arxiv	2023	Code
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems		paper	arxiv	2023
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models		paper	arxiv	2023	Code
A Survey on Transformer Compression		paper	arxiv	2024

Network Pruning

Title	Introduction	Links	Conference	Year	Code
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot	SparseGPT	paper	ICML	2023	Code
A Simple and Effective Pruning Approach for Large Language Models	Wanda	paper	arxiv	2023	Code
LLM-Pruner: On the Structural Pruning of Large Language Models	LLM-Pruner	paper	NeurIPS	2023	Code
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter		paper	NeurIPS	2023	Code
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity	Flash-LLM	paper	VLDB	2024	Code
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models	NASH	paper	EMNLP	2023	Code
Pruning Large Language Models via Accuracy Predictor		paper	arxiv	2023
Compressing LLMs: The Truth is Rarely Pure and Never Simple	Compressing LLMs	paper	arxiv	2023
Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity		paper	arxiv	2023	Code
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity		paper	arxiv	2023	Code
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models	Compresso	paper	arxiv	2023	Code
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	Sheared LLaMA	paper	arxiv	2023	Code
Sparse Finetuning for Inference Acceleration of Large Language Models		paper	arxiv	2023	Code
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models		paper	arxiv	2023
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning		paper	arxiv	2023
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs		paper	arxiv	2023
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models		paper	arxiv	2023
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery	LoRAShear	paper	arxiv	2023	Code
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization		paper	arxiv	2023	Code
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models		paper	arxiv	2023	Code
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs		paper	arxiv	2023	Code
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity	E-Sparse	paper	arxiv	2023
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs	PERP	paper	arxiv	2023	Code
Fast and Optimal Weight Update for Pruned Large Language Models		paper	arxiv	2024	Code
SliceGPT: Compress Large Language Models by Deleting Rows and Columns		paper	arxiv	2024	Code
Shortened LLaMA: A Simple Depth Pruning for Large Language Models		paper	arxiv	2024
SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks		paper	arxiv	2024	Code
HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference		paper	arxiv	2024
LaCo: Large Language Model Pruning via Layer Collapse		paper	arxiv	2024
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models		paper	arxiv	2024	Code
EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs		paper	arxiv	2024	Code
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation		paper	arxiv	2024	Code

Quantization

Title	Introduction	Links	Conference	Year	Code
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale		paper	arxiv	2023
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	GPTQ	paper	ICLR	2022	Code
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	SmoothQuant	paper	ICML	2023	Code
GPT-Zip: Deep Compression of Finetuned Large Language Models	GPT-Zip	paper	ICML	2023
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM		paper	ICML	2023	Code
QLoRA: Efficient Finetuning of Quantized LLMs	QLoRA	paper	NeurIPS	2023	Code
QuIP: 2-Bit Quantization of Large Language Models With Guarantees	QuIP	paper	NeurIPS	2023	Code
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization		paper	NeurIPS	2023
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing		paper	NeurIPS	2023	Code
LLM-FP4: 4-Bit Floating-Point Quantized Transformers	LLM-FP4	paper	EMNLP	2023	Code
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization		paper	EMNLP	2023
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge	Agile-Quant	paper	AAAI	2024
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks		paper	arxiv	2023	Code
Watermarking LLMs with Weight Quantization		paper	arxiv	2023	Code
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	AWQ	paper	arxiv	2023	Code
RPTQ: Reorder-based Post-training Quantization for Large Language Models	RPTQ	paper	arxiv	2023	Code
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation	ZeroQuant-V2	paper	arxiv	2023
SqueezeLLM: Dense-and-Sparse Quantization	SqueezeLLM	paper	arxiv	2023	Code
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling		paper	arxiv	2023
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models		paper	arxiv	2023
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models	LLM-QAT	paper	arxiv	2023
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	SpQR	paper	arxiv	2023	Code
OWQ: Lessons learned from activation outliers for weight quantization in large language models	OWQ	paper	arxiv	2023	Code
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study		paper	arxiv	2023	Code
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats	ZeroQuant-FP	paper	arxiv	2023
FPTQ: Fine-grained Post-Training Quantization for Large Language Models	FPTQ	paper	arxiv	2023
QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm	QuantEase	paper	arxiv	2023
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models		paper	arxiv	2023
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs		paper	arxiv	2023	Code
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models	QA-LoRA	paper	arxiv	2023	Code
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers	ModuLoRA	paper	arxiv	2023
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM		paper	arxiv	2023
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources	QFT	paper	arxiv	2023
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models	QLLM	paper	arxiv	2023
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	LoftQ	paper	ICLR	2024	Code
TEQ: Trainable Equivalent Transformation for Quantization of LLMs	TEQ	paper	arxiv	2023	Code
BitNet: Scaling 1-bit Transformers for Large Language Models		paper	arxiv	2023
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving	Atom	paper	arxiv	2023
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models		paper	arxiv	2023
AFPQ: Asymmetric Floating Point Quantization for LLMs		paper	arxiv	2023	Code
A Speed Odyssey for Deployable Quantization of LLMs		paper	arxiv	2023
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning	LQ-LoRA	paper	arxiv	2023	Code
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization		paper	arxiv	2023
Extreme Compression of Large Language Models via Additive Quantization	AQLM	paper	arxiv	2023	Code
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models QMoE		paper	arxiv	2023	Code
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models		paper	arxiv	2023	Code

Knowledge Distillation

Title	Introduction	Links	Conference	Year	Code
Specializing Smaller Language Models towards Multi-Step Reasoning		paper	ICML	2023	Code
Distilling Script Knowledge from Large Language Models for Constrained Language Planning		paper	ACL	2023	Code
SCOTT: Self-Consistent Chain-of-Thought Distillation	SCOTT	paper	ACL	2023	Code
DISCO: Distilling Counterfactuals with Large Language Models	DISCO	paper	ACL	2023	Code
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation	I2D2	paper	ACL	2023
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step		paper	ACL	2023
GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model	GKD	paper	ACL	2023	Code
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes		paper	ACL	2023	Code
Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind		paper	NeurIPS	2023	Code
Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents		paper	EMNLP	2023	Code
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation	PromptMix	paper	EMNLP	2023	Code
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression		paper	EMNLP	2023
Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models		paper	EMNLP	2023	Code
Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data		paper	AAAI	2024	Code
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions	LaMini-LM	paper	arxiv	2023	Code
Knowledge Distillation of Large Language Models		paper	arxiv	2023	Code
Teaching Small Language Models to Reason		paper	arxiv	2023
Large Language Model Distillation Doesn't Need a Teacher		paper	arxiv	2023	Code
The False Promise of Imitating Proprietary LLMs		paper	arxiv
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing		paper	arxiv	2023	Code
PaD: Program-aided Distillation Specializes Large Models in Reasoning	PaD	paper	arxiv	2023
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment	RLCD	paper	arxiv	2023
Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA	Sci-CoT	paper	arxiv	2023
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition		paper	arxiv	2023	Code
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty	BabyLlama	paper	arxiv	2023	Code
DistillSpec: Improving Speculative Decoding via Knowledge Distillation		paper	arxiv	2023	Code
Zephyr: Direct Distillation of LM Alignment		paper	arxiv	2023	Code
Towards the Law of Capacity Gap in Distilling Language Models		paper	arxiv	2023	Code
Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models		paper	arxiv	2023
Mixed Distillation Helps Smaller Language Model Better Reasoning		paper	arxiv	2023

Fusion

Title	Introduction	Links	Conference	Year	Code
TinySAM: Pushing the Envelope for Efficient Segment Anything Model		paper	arxiv	2023	Code

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Compression Papers

Introduction

Survey

Network Pruning

Quantization

Knowledge Distillation

Fusion

References

About

Releases

Packages

Contributors 2

License

bupt-ai-club/llm-compression-papers

Folders and files

Latest commit

History

Repository files navigation

LLM Compression Papers

Introduction

Survey

Network Pruning

Quantization

Knowledge Distillation

Fusion

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages