{% hint style="info" %} I am actively maintaining this list. {% endhint %}
- Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (ATC 2024) [Paper] [Code]
- Kuaishou
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [Paper] [Code] [Docs]
- UC Berkeley & AWS & Google & SJTU & CMU & Duke
- Generalize the search through parallelism strategies.
- Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 2023) [Paper] [arXiv] [Code]
- UMich SymbioticLab & AWS & PKU
- Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 2023) [Paper]
- Rice & AWS
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (NSDI 2023) [Paper] [Code]
- UCLA & CMU & MSR & Princeton
- Resilient distributed training
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]
- UChicago & Microsoft & Stanford
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (OSDI 2024) [Paper] [Code] [arXiv]
- MSR India & GaTech
- Sarathi-Serve
- ServerlessLLM: Low-Latency Serverless Inference for Large Language Models (OSDI 2024) [Paper] [Code] [arXiv]
- Edinburgh
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (OSDI 2024) [Paper] [Code]
- SJTU & MSRA
- Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)
- ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)
- Efficiently Programming Large Language Models using SGLang (arXiv 2312.07104) [Personal Notes] [arXiv] [Code]
- UC Berkeley & Stanford
- Co-design the front-end programming interface and back-end serving runtime
- SGLang; SGVM w/ RadixAttention
- Reuse KV cache across multiple calls and programs
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (arXiv 2312.12456) [arXiv]
- SJTU
- A GPU-CPU hybrid inference engine
- Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]
- Apple
- SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]
- CMU & PKU & CUHK
- Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paper] [Code]
- Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU
- A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).
- Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]
- PKU
- Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
- Proactive KV cache swapping.
- Compared to Orca
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 2023) [Paper] [Code]
- UC Berkeley & PKU & UPenn & Stanford & Google
- Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.
- Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]
- Outstanding Paper Award
- Model partitioning; PaLM; TPUv4
- DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (SC 2022) [Paper] [Code] [Homepage]
- Microsoft DeepSpeed
- Leverage CPU/NVMe/GPU memory.
- Llumnix: Dynamic Scheduling for Large Language Model Serving (OSDI 2024) [Paper] [Code]
- Alibaba
- Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]
- Seoul National University & FriendliAI
- Iteration-level scheduling; selective batching.
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI 2024) [Paper]
- Seoul National University
- Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) [Paper] [arXiv] [Code] [Homepage]
- UC Berkeley & Stanford & UCSD
- vLLM, PagedAttention
- Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (arXiv:2407.00079) [arXiv] [Code]
- Mootshot AI & Tsinghua
- Separate the prefill and decoding clusters; prediction-based early rejection.
- Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]
- ICT, CAS & Huawei Cloud
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 2024) [Paper] [Code]
- PKU & UCSD
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting (ISCA 2024) [arXiv] [Blog]
- UW & Microsoft
- Best Paper Award
- Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (OSDI 2024) [Paper]
- PKU & Shanghai AI Lab
- CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]
- HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]
- UC Berkeley
- Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]
- UW & Duke
- Online Speculative Decoding (ICML 2024) [arXiv]
- UC Berkeley & UCSD & Sisu Data & SJTU
- SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]
- CMU
- Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]
- UC Berkeley & ICSI & LBNL
- Fast Inference from Transformers via Speculative Decoding (ICML 2023) [Paper]
- Google Research
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]
- Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
- High-throughput serving; only use a single GPU.
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]
- HKUST & ETH & CMU
- Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
- Propose a heuristic-based evolutionary algorithm to search for the optimal layout
- PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch (ATC 2024) [Paper]
- THU
- LLM: Large Language Model
- LoRA: Low-Rank Adaptation