Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes. ❤️Star🌟👆🏻this repo to support me if it does any helps to you~
@misc{Awesome-LLM-Inference@2023,
title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
url={https://github.com/DefTruth/Awesome-LLM-Inference},
note={Open-source software available at https://github.com/DefTruth/Awesome-LLM-Inference},
author={Yanjun Qiu},
year={2023}
}
@Awesome-LLM-Inference-v0.3.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.
- LLM Algorithmic/Eval Survey
- LLM Train/Inference Framework
- Weight/Activation Quantize/Compress
- Continuous/In-flight Batching
- IO/FLOPs-Aware/Sparse Attention
- KV Cache Scheduling/Quantize/Dropping
- Early-Exit/Intermediate Layer Decoding
- Parallel Decoding/Sampling
- Structured Pruning/Knowledge Distillation
- CPU/Single GPU/Mobile Inference
- Non Transformer Architecture
- GEMM、Tensor Cores、WMMA
- Position Embed、Others
📖LLM Algorithmic/Eval Survey (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2023.10 | [Evaluating] Evaluating Large Language Models: A Comprehensive Survey(@tju.edu.cn) | [pdf] | [Awesome-LLMs-Evaluation] | ⭐️ |
2023.11 | 🔥[Runtime Performance] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models(@hkust-gz.edu.cn) | [pdf] | ⭐️⭐️ | |
2023.11 | [ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?(@e.ntu.edu.sg) | [pdf] | ⭐️ | |
2023.12 | [Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey(@Microsoft) | [pdf] | ⭐️ | |
2023.12 | [Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly(@Drexel University) | [pdf] | ⭐️ | |
2023.12 | 🔥[LLMCompass] A Hardware Evaluation Framework for Large Language Model Inference(@princeton.edu) | [pdf] | ⭐️⭐️ | |
2023.12 | 🔥[Efficient LLMs] Efficient Large Language Models: A Survey(@Ohio State University etc) | [pdf] | [Efficient-LLMs-Survey] | ⭐️⭐️ |
📖LLM Train/Inference Framework (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2020.05 | 🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA) | [pdf] | [Megatron-LM] | ⭐️⭐️ |
2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc) | [pdf] | [FlexGen] | ⭐️ |
2023.05 | [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc) | [pdf] | [FlexFlow] | ⭐️ |
2023.05 | [FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc) | [pdf] | ⭐️ | |
2023.09 | 🔥[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) | [pdf] | [vllm] | ⭐️⭐️ |
2023.09 | [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc) | [pdf] | [streaming-llm] | ⭐️ |
2023.09 | [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc) | [blog] | [Medusa] | ⭐️ |
2023.10 | 🔥[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA) | [docs] | [TensorRT-LLM] | ⭐️⭐️ |
2023.11 | 🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft) | [blog] | [deepspeed-fastgen] | ⭐️⭐️ |
2023.12 | 🔥[PETALS] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc) | [pdf] | [petals] | ⭐️⭐️ |
📖Continuous/In-flight Batching (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2022.07 | 🔥[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models(@Seoul National University etc) | [pdf] | ⭐️⭐️ | |
2023.10 | 🔥[In-flight Batching] NVIDIA TensorRT LLM Batch Manager(@NVIDIA) | [docs] | [TensorRT-LLM] | ⭐️⭐️ |
2023.11 | 🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft) | [blog] | [deepspeed-fastgen] | ⭐️⭐️ |
2023.11 | [Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting(@Microsoft etc) | [pdf] | ⭐️ | |
2023.12 | [SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances(@cmu.edu etc) | [pdf] | [SpotServe] | ⭐️ |
📖Weight/Activation Quantize/Compress (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2022.06 | 🔥[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers(@Microsoft) | [pdf] | [DeepSpeed] | ⭐️⭐️ |
2022.08 | [FP8-Quantization] FP8 Quantization: The Power of the Exponent(@Qualcomm AI Research) | [pdf] | ⭐️ | |
2022.08 | [LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale(@Facebook AI Research etc) | [pdf] | [bitsandbytes] | ⭐️ |
2022.10 | 🔥[GPTQ] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS(@IST Austria etc) | [pdf] | [gptq] | ⭐️⭐️ |
2022.11 | 🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft) | [pdf] | [FasterTransformer] | ⭐️⭐️ |
2022.11 | 🔥[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models(@MIT etc) | [pdf] | [smoothquant] | ⭐️⭐️ |
2023.03 | [ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation(@Microsoft) | [pdf] | [DeepSpeed] | ⭐️ |
2023.06 | 🔥[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration(@MIT etc) | [pdf] | [llm-awq] | ⭐️⭐️ |
2023.06 | [SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression(@University of Washington etc) | [pdf] | [SpQR] | ⭐️ |
2023.06 | [SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION(@berkeley.edu) | [pdf] | [SqueezeLLM] | ⭐️ |
2023.07 | [ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats(@Microsoft) | [pdf] | [DeepSpeed] | ⭐️ |
2023.09 | [KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) | [blog] | ⭐️ | |
2023.10 | [FP8-LM] FP8-LM: Training FP8 Large Language Models(@Microsoft etc) | [pdf] | [MS-AMP] | ⭐️ |
2023.10 | [LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING(@cs.princeton.edu etc) | [pdf] | [LLM-Shearing] | ⭐️ |
2023.10 | [LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers(@ust.hk&meta etc) | [pdf] | [LLM-FP4] | ⭐️ |
2023.11 | [2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization(@Shanghai Jiao Tong University etc) | [pdf] | ⭐️ | |
2023.12 | [SmoothQuant+] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM(@ZTE Corporation) | [pdf] | [smoothquantplus] | ⭐️ |
2023.11 | [OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs(@meituan.com) | [pdf] | ⭐️ | |
2023.12 | 🔥[SparQ] SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE(@graphcore.ai) | [pdf] | ⭐️⭐️ | |
2023.12 | [Agile-Quant] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge(@Northeastern University&Oracle) | [pdf] | ⭐️ | |
2023.12 | [CBQ] CBQ: Cross-Block Quantization for Large Language Models(@ustc.edu.cn) | [pdf] | ⭐️ | |
2023.10 | [QLLM] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS(@ZIP Lab&SenseTime Research etc) | [pdf] | ⭐️ |
📖IO/FLOPs-Aware/Sparse Attention (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2018.05 | [Online Softmax] Online normalizer calculation for softmax(@NVIDIA) | [pdf] | ⭐️ | |
2019.11 | 🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) | [pdf] | ⭐️⭐️ | |
2020.10 | [Hash Attention] REFORMER: THE EFFICIENT TRANSFORMER(@Google) | [pdf] | [reformer] | ⭐️⭐️ |
2022.05 | 🔥[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness(@Stanford University etc) | [pdf] | [flash-attention] | ⭐️⭐️ |
2022.10 | [Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY(@Google) | [pdf] | ⭐️ | |
2023.05 | [FlashAttention] From Online Softmax to FlashAttention(@cs.washington.edu) | [pdf] | ⭐️⭐️ | |
2023.05 | [FLOP, I/O] Dissecting Batching Effects in GPT Inference(@Lequn Chen) | [blog] | ⭐️ | |
2023.05 | 🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) | [pdf] | [flaxformer] | ⭐️⭐️ |
2023.06 | [Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) | [pdf] | [dynamic-sparse-flash-attention] | ⭐️ |
2023.07 | 🔥[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning(@Stanford University etc) | [pdf] | [flash-attention] | ⭐️⭐️ |
2023.10 | 🔥[Flash-Decoding] Flash-Decoding for long-context inference(@Stanford University etc) | [blog] | [flash-attention] | ⭐️⭐️ |
2023.11 | [Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS(@Tsinghua University&Infinigence-AI) | [pdf] | ⭐️ | |
2023.01 | [SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot(@ISTA etc) | [pdf] | [sparsegpt] | ⭐️ |
2023.11 | 🔥[HyperAttention] HyperAttention: Long-context Attention in Near-Linear Time(@yale&Google) | [pdf] | hyper-attn | ⭐️⭐️ |
2023.11 | [Streaming Attention Approximation] One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space(@Adobe Research etc) | [pdf] | ⭐️ | |
2023.12 | 🔥[GLA] Gated Linear Attention Transformers with Hardware-Efficient Training(@MIT-IBM Watson AI) | [pdf] | gated_linear_attention | ⭐️⭐️ |
2023.12 | [SCCA] SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion(@Beihang University) | [pdf] | ⭐️ | |
2023.05 | [Landmark Attention] Random-Access Infinite Context Length for Transformers(@epfl.ch) | [pdf] | landmark-attention | ⭐️⭐️ |
2023.12 | 🔥[FlashLLM] LLM in a flash: Efficient Large Language Model Inference with Limited Memory(@Apple) | [pdf] | ⭐️⭐️ |
📖KV Cache Scheduling/Quantize/Dropping (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2019.11 | 🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) | [pdf] | ⭐️⭐️ | |
2022.06 | [LTP] Learned Token Pruning for Transformers(@UC Berkeley etc) | [pdf] | [LTP] | ⭐️ |
2023.05 | 🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) | [pdf] | [flaxformer] | ⭐️⭐️ |
2023.05 | [KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time(@) | [pdf] | ⭐️⭐️ | |
2023.06 | [H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(@Rice University etc) | [pdf] | [H2O] | ⭐️ |
2023.06 | [QK-Sparse/Dropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) | [pdf] | [dynamic-sparse-flash-attention] | ⭐️ |
2023.09 | 🔥🔥[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) | [pdf] | [vllm] | ⭐️⭐️ |
2023.09 | [KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) | [blog] | ⭐️ | |
2023.10 | 🔥[TensorRT-LLM KV Cache FP8] NVIDIA TensorRT LLM(@NVIDIA) | [docs] | [TensorRT-LLM] | ⭐️⭐️ |
2023.10 | 🔥[Adaptive KV Cache Compress] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(@illinois.eduµsoft) | [pdf] | ⭐️⭐️ | |
2023.10 | [CacheGen] CacheGen: Fast Context Loading for Language Model Applications(@Chicago University&Microsoft) | [pdf] | ⭐️ | |
2023.12 | [KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc) | [pdf] | ⭐️ |
📖Early-Exit/Intermediate Layer Decoding (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2020.04 | [DeeBERT] DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference(@uwaterloo.ca) | [pdf] | ⭐️ | |
2021.06 | [BERxiT] BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression(@uwaterloo.ca) | [pdf] | [berxit] | ⭐️ |
2023.10 | 🔥[LITE] Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE(@Arizona State University) | [pdf] | ⭐️⭐️ | |
2023.12 | 🔥🔥[EE-LLM] EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism(@alibaba-inc.com) | [pdf] | [EE-LLM] | ⭐️⭐️ |
2023.10 | 🔥[FREE] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding(@KAIST AI&AWS AI) | [pdf] | [fast_robust_early_exit] | ⭐️⭐️ |
📖Parallel Decoding/Sampling (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2018.11 | 🔥[Parallel Decoding] Blockwise Parallel Decoding for Deep Autoregressive Models(@Berkeley&Google) | [pdf] | ⭐️⭐️ | |
2023.02 | 🔥[Speculative Sampling] Accelerating Large Language Model Decoding with Speculative Sampling(@DeepMind) | [pdf] | [LLMSpeculativeSampling] | ⭐️⭐️ |
2023.05 | 🔥[Speculative Sampling] Fast Inference from Transformers via Speculative Decoding(@Google Research etc) | [pdf] | [LLMSpeculativeSampling] | ⭐️⭐️ |
2023.09 | 🔥[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc) | [blog] | [Medusa] | ⭐️⭐️ |
2023.10 | [OSD] Online Speculative Decoding(@UC Berkeley etc) | [pdf] | ⭐️⭐️ | |
2023.12 | [Cascade Speculative] Cascade Speculative Drafting for Even Faster LLM Inference(@illinois.edu) | [pdf] | ⭐️ |
📖Structured Pruning/Knowledge Distillation (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2023.12 | [FLAP] Fluctuation-based Adaptive Structured Pruning for Large Language Models(@Chinese Academy of Sciences etc) | [pdf] | [FLAP] | ⭐️⭐️ |
📖CPU/Single GPU/Mobile Inference (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc) | [pdf] | [FlexGen] | ⭐️ |
2023.11 | [LLM CPU Inference] Efficient LLM Inference on CPUs(@intel) | [pdf] | [intel-extension-for-transformers] | ⭐️ |
2023.12 | [LinguaLinked] LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices(@University of California Irvine) | [pdf] | ⭐️ | |
2023.12 | [OpenVINO] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc) | [pdf] | ⭐️ |
📖Non Transformer Architecture (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2023.05 | 🔥🔥[RWKV] RWKV: Reinventing RNNs for the Transformer Era(@Bo Peng etc) | [pdf] | [RWKV-LM] | ⭐️⭐️ |
2023.12 | 🔥🔥[Mamba] Mamba: Linear-Time Sequence Modeling with Selective State Spaces(@cs.cmu.edu etc) | [pdf] | [mamba] | ⭐️⭐️ |
📖GEMM、Tensor Cores、WMMA (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2018.03 | [Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision(@KTH Royal etc) | [pdf] | ⭐️ | |
2022.09 | [FP8] FP8 FORMATS FOR DEEP LEARNING(@NVIDIA) | [pdf] | ⭐️ | |
2023.08 | [Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library(@Tokyo Institute etc) | [pdf] | [wmma_extension] | ⭐️ |
📖Position Embed、Others (©️back👆🏻)
Date | Title | Paper | Code | Recom |
---|---|---|---|---|
2021.04 | 🔥[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING(@Zhuiyi Technology Co., Ltd.) | [pdf] | [transformers] | ⭐️ |
2022.10 | [ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs(@ByteDance&NVIDIA) | [pdf] | [ByteTransformer] | ⭐️ |
GNU General Public License v3.0
Welcome to submit a PR to this repo!