Skip to content

bupt-ai-club/llm-compression-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

LLM Compression Papers 

Introduction

Welcome to the LLM Compression Papers repository! This repository is dedicated to the collection and discussion of academic and industry papers focused on the compression techniques for large language models (LLMs). As the capabilities of LLMs continue to expand, so does their size and complexity. Consequently, efficient compression methods have become crucial for making these models more accessible and practical for real-world applications.  

Survey

Title Introduction Links Conference Year Code
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot SparseGPT paper ICML 2023 Code
Efficient Large Language Models: A Survey paper arxiv 2023 Code
A Survey on Deep Neural Network Pruning:Taxonomy, Comparison, Analysis, and Recommendations paper arxiv 2023 Code
A Survey on Model Compression for Large Language Models paper arxiv 2023
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey paper arxiv 2023 Code
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems paper arxiv 2023
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models paper arxiv 2023 Code
A Survey on Transformer Compression paper arxiv 2024

Network Pruning

Title Introduction Links Conference Year Code
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot SparseGPT paper ICML 2023 Code
A Simple and Effective Pruning Approach for Large Language Models Wanda paper arxiv 2023 Code
LLM-Pruner: On the Structural Pruning of Large Language Models LLM-Pruner paper NeurIPS 2023 Code
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter paper NeurIPS 2023 Code
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity Flash-LLM paper VLDB 2024 Code
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models NASH paper EMNLP 2023 Code
Pruning Large Language Models via Accuracy Predictor paper arxiv 2023
Compressing LLMs: The Truth is Rarely Pure and Never Simple Compressing LLMs paper arxiv 2023
Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity paper arxiv 2023 Code
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity paper arxiv 2023 Code
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models Compresso paper arxiv 2023 Code
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning Sheared LLaMA paper arxiv 2023 Code
Sparse Finetuning for Inference Acceleration of Large Language Models paper arxiv 2023 Code
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models paper arxiv 2023
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning paper arxiv 2023
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs paper arxiv 2023
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models paper arxiv 2023
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery LoRAShear paper arxiv 2023 Code
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization paper arxiv 2023 Code
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models paper arxiv 2023 Code
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs paper arxiv 2023 Code
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity E-Sparse paper arxiv 2023
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs PERP paper arxiv 2023 Code
Fast and Optimal Weight Update for Pruned Large Language Models paper arxiv 2024 Code
SliceGPT: Compress Large Language Models by Deleting Rows and Columns paper arxiv 2024 Code
Shortened LLaMA: A Simple Depth Pruning for Large Language Models paper arxiv 2024
SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks paper arxiv 2024 Code
HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference paper arxiv 2024
LaCo: Large Language Model Pruning via Layer Collapse paper arxiv 2024
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models paper arxiv 2024 Code
EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs paper arxiv 2024 Code
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation paper arxiv 2024 Code

Quantization  

Title Introduction Links Conference Year Code
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale paper arxiv 2023
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers GPTQ paper ICLR 2022 Code
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models SmoothQuant paper ICML 2023 Code
GPT-Zip: Deep Compression of Finetuned Large Language Models GPT-Zip paper ICML 2023
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM paper ICML 2023 Code
QLoRA: Efficient Finetuning of Quantized LLMs QLoRA paper NeurIPS 2023 Code
QuIP: 2-Bit Quantization of Large Language Models With Guarantees QuIP paper NeurIPS 2023 Code
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization paper NeurIPS 2023
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing paper NeurIPS 2023 Code
LLM-FP4: 4-Bit Floating-Point Quantized Transformers LLM-FP4 paper EMNLP 2023 Code
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization paper EMNLP 2023
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge Agile-Quant paper AAAI 2024
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks paper arxiv 2023 Code
Watermarking LLMs with Weight Quantization paper arxiv 2023 Code
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration AWQ paper arxiv 2023 Code
RPTQ: Reorder-based Post-training Quantization for Large Language Models RPTQ paper arxiv 2023 Code
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation ZeroQuant-V2 paper arxiv 2023
SqueezeLLM: Dense-and-Sparse Quantization SqueezeLLM paper arxiv 2023 Code
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling paper arxiv 2023
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models paper arxiv 2023
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models LLM-QAT paper arxiv 2023
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression SpQR paper arxiv 2023 Code
OWQ: Lessons learned from activation outliers for weight quantization in large language models OWQ paper arxiv 2023 Code
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study paper arxiv 2023 Code
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats ZeroQuant-FP paper arxiv 2023
FPTQ: Fine-grained Post-Training Quantization for Large Language Models FPTQ paper arxiv 2023
QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm QuantEase paper arxiv 2023
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models paper arxiv 2023
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs paper arxiv 2023 Code
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models QA-LoRA paper arxiv 2023 Code
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers ModuLoRA paper arxiv 2023
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM paper arxiv 2023
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources QFT paper arxiv 2023
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models QLLM paper arxiv 2023
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models LoftQ paper ICLR 2024 Code
TEQ: Trainable Equivalent Transformation for Quantization of LLMs TEQ paper arxiv 2023 Code
BitNet: Scaling 1-bit Transformers for Large Language Models paper arxiv 2023
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving Atom paper arxiv 2023
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models paper arxiv 2023
AFPQ: Asymmetric Floating Point Quantization for LLMs paper arxiv 2023 Code
A Speed Odyssey for Deployable Quantization of LLMs paper arxiv 2023
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning LQ-LoRA paper arxiv 2023 Code
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization paper arxiv 2023
Extreme Compression of Large Language Models via Additive Quantization AQLM paper arxiv 2023 Code
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models QMoE paper arxiv 2023 Code
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models paper arxiv 2023 Code

Knowledge Distillation

Title Introduction Links Conference Year Code
Specializing Smaller Language Models towards Multi-Step Reasoning paper ICML 2023 Code
Distilling Script Knowledge from Large Language Models for Constrained Language Planning paper ACL 2023 Code
SCOTT: Self-Consistent Chain-of-Thought Distillation SCOTT paper ACL 2023 Code
DISCO: Distilling Counterfactuals with Large Language Models DISCO paper ACL 2023 Code
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation I2D2 paper ACL 2023
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step paper ACL 2023
GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model GKD paper ACL 2023 Code
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes paper ACL 2023 Code
Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind paper NeurIPS 2023 Code
Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents paper EMNLP 2023 Code
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation PromptMix paper EMNLP 2023 Code
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression paper EMNLP 2023
Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models paper EMNLP 2023 Code
Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data paper AAAI 2024 Code
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions LaMini-LM paper arxiv 2023 Code
Knowledge Distillation of Large Language Models paper arxiv 2023 Code
Teaching Small Language Models to Reason paper arxiv 2023
Large Language Model Distillation Doesn't Need a Teacher paper arxiv 2023 Code
The False Promise of Imitating Proprietary LLMs paper arxiv
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing paper arxiv 2023 Code
PaD: Program-aided Distillation Specializes Large Models in Reasoning PaD paper arxiv 2023
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment RLCD paper arxiv 2023
Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA Sci-CoT paper arxiv 2023
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition paper arxiv 2023 Code
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty BabyLlama paper arxiv 2023 Code
DistillSpec: Improving Speculative Decoding via Knowledge Distillation paper arxiv 2023 Code
Zephyr: Direct Distillation of LM Alignment paper arxiv 2023 Code
Towards the Law of Capacity Gap in Distilling Language Models paper arxiv 2023 Code
Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models paper arxiv 2023
Mixed Distillation Helps Smaller Language Model Better Reasoning paper arxiv 2023

Fusion

Title Introduction Links Conference Year Code
TinySAM: Pushing the Envelope for Efficient Segment Anything Model paper arxiv 2023 Code

References