Large Language Model (LLM)

{% hint style="info" %} I am actively maintaining this list. {% endhint %}

LLM Training

Hybrid parallelism

Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (ATC 2024) [Paper] [Code]
- Kuaishou
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [Paper] [Code] [Docs]
- UC Berkeley & AWS & Google & SJTU & CMU & Duke
- Generalize the search through parallelism strategies.

Fault tolerance

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 2023) [Paper] [arXiv] [Code]
- UMich SymbioticLab & AWS & PKU
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 2023) [Paper]
- Rice & AWS
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (NSDI 2023) [Paper] [Code]
- UCLA & CMU & MSR & Princeton
- Resilient distributed training

LLM Inference

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]
- UChicago & Microsoft & Stanford
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (OSDI 2024) [Paper] [Code] [arXiv]
- MSR India & GaTech
- Sarathi-Serve
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models (OSDI 2024) [Paper] [Code] [arXiv]
- Edinburgh
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (OSDI 2024) [Paper] [Code]
- SJTU & MSRA
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)
Efficiently Programming Large Language Models using SGLang (arXiv 2312.07104) [Personal Notes] [arXiv] [Code]
- UC Berkeley & Stanford
- Co-design the front-end programming interface and back-end serving runtime
- SGLang; SGVM w/ RadixAttention
- Reuse KV cache across multiple calls and programs
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (arXiv 2312.12456) [arXiv]
- SJTU
- A GPU-CPU hybrid inference engine
- Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]
- Apple
SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]
- CMU & PKU & CUHK
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paper] [Code]
- Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU
- A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).
Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]
- PKU
- Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
- Proactive KV cache swapping.
- Compared to Orca
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 2023) [Paper] [Code]
- UC Berkeley & PKU & UPenn & Stanford & Google
- Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.
Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]
- Google
- Outstanding Paper Award
- Model partitioning; PaLM; TPUv4
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (SC 2022) [Paper] [Code] [Homepage]
- Microsoft DeepSpeed
- Leverage CPU/NVMe/GPU memory.

Request Scheduling

Llumnix: Dynamic Scheduling for Large Language Model Serving (OSDI 2024) [Paper] [Code]
- Alibaba
Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]
- Seoul National University & FriendliAI
- Iteration-level scheduling; selective batching.

KV Cache Management

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI 2024) [Paper]
- Seoul National University
Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) [Paper] [arXiv] [Code] [Homepage]
- UC Berkeley & Stanford & UCSD
- vLLM, PagedAttention
- Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens

Phase Disaggregation

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (arXiv:2407.00079) [arXiv] [Code]
- Mootshot AI & Tsinghua
- Separate the prefill and decoding clusters; prediction-based early rejection.
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]
- ICT, CAS & Huawei Cloud
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 2024) [Paper] [Code]
- PKU & UCSD
Splitwise: Efficient Generative LLM Inference Using Phase Splitting (ISCA 2024) [arXiv] [Blog]
- UW & Microsoft
- Best Paper Award
- Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines

LoRA Serving

dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (OSDI 2024) [Paper]
- PKU & Shanghai AI Lab
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]
- HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud
S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]
- UC Berkeley
Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]
- UW & Duke

Speculative Decoding

Online Speculative Decoding (ICML 2024) [arXiv]
- UC Berkeley & UCSD & Sisu Data & SJTU
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]
- CMU
Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]
- UC Berkeley & ICSI & LBNL
Fast Inference from Transformers via Speculative Decoding (ICML 2023) [Paper]
- Google Research

Offloading

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]
- Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
- High-throughput serving; only use a single GPU.

Heterogeneous Environment

HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]
- HKUST & ETH & CMU
- Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
- Propose a heuristic-based evolutionary algorithm to search for the optimal layout

Fairness

Fairness in Serving Large Language Models (OSDI 2024) [Paper] [Code]
- UC Berkeley

LLM Alignment

PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch (ATC 2024) [Paper]
- THU

Acronyms

LLM: Large Language Model
LoRA: Low-Rank Adaptation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm.md

llm.md

Large Language Model (LLM)

LLM Training

Hybrid parallelism

Fault tolerance

LLM Inference

Request Scheduling

KV Cache Management

Phase Disaggregation

LoRA Serving

Speculative Decoding

Offloading

Heterogeneous Environment

Fairness

LLM Alignment

Acronyms

Files

llm.md

Latest commit

History

llm.md

File metadata and controls

Large Language Model (LLM)

LLM Training

Hybrid parallelism

Fault tolerance

LLM Inference

Request Scheduling

KV Cache Management

Phase Disaggregation

LoRA Serving

Speculative Decoding

Offloading

Heterogeneous Environment

Fairness

LLM Alignment

Acronyms