Skip to content

Latest commit

 

History

History
151 lines (123 loc) · 12.3 KB

File metadata and controls

151 lines (123 loc) · 12.3 KB

Large Language Model (LLM)

{% hint style="info" %} I am actively maintaining this list. {% endhint %}

LLM Training

Hybrid parallelism

  • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (ATC 2024) [Paper] [Code]
    • Kuaishou
  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [Paper] [Code] [Docs]
    • UC Berkeley & AWS & Google & SJTU & CMU & Duke
    • Generalize the search through parallelism strategies.

Fault tolerance

  • Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 2023) [Paper] [arXiv] [Code]
    • UMich SymbioticLab & AWS & PKU
  • Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 2023) [Paper]
    • Rice & AWS
  • Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (NSDI 2023) [Paper] [Code]
    • UCLA & CMU & MSR & Princeton
    • Resilient distributed training

LLM Inference

  • CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]
    • UChicago & Microsoft & Stanford
  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (OSDI 2024) [Paper] [Code] [arXiv]
    • MSR India & GaTech
    • Sarathi-Serve
  • ServerlessLLM: Low-Latency Serverless Inference for Large Language Models (OSDI 2024) [Paper] [Code] [arXiv]
    • Edinburgh
  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (OSDI 2024) [Paper] [Code]
    • SJTU & MSRA
  • Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)
  • ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)
  • Efficiently Programming Large Language Models using SGLang (arXiv 2312.07104) [Personal Notes] [arXiv] [Code]
    • UC Berkeley & Stanford
    • Co-design the front-end programming interface and back-end serving runtime
    • SGLang; SGVM w/ RadixAttention
    • Reuse KV cache across multiple calls and programs
  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (arXiv 2312.12456) [arXiv]
    • SJTU
    • A GPU-CPU hybrid inference engine
    • Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU
  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]
    • Apple
  • SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]
    • CMU & PKU & CUHK
  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paper] [Code]
    • Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU
    • A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).
  • Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]
    • PKU
    • Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
    • Proactive KV cache swapping.
    • Compared to Orca
  • AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 2023) [Paper] [Code]
    • UC Berkeley & PKU & UPenn & Stanford & Google
    • Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.
  • Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]
    • Google
    • Outstanding Paper Award
    • Model partitioning; PaLM; TPUv4
  • DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (SC 2022) [Paper] [Code] [Homepage]
    • Microsoft DeepSpeed
    • Leverage CPU/NVMe/GPU memory.

Request Scheduling

  • Llumnix: Dynamic Scheduling for Large Language Model Serving (OSDI 2024) [Paper] [Code]
    • Alibaba
  • Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]
    • Seoul National University & FriendliAI
    • Iteration-level scheduling; selective batching.

KV Cache Management

  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI 2024) [Paper]
    • Seoul National University
  • Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) [Paper] [arXiv] [Code] [Homepage]
    • UC Berkeley & Stanford & UCSD
    • vLLM, PagedAttention
    • Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens

Phase Disaggregation

  • Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (arXiv:2407.00079) [arXiv] [Code]
    • Mootshot AI & Tsinghua
    • Separate the prefill and decoding clusters; prediction-based early rejection.
  • Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]
    • ICT, CAS & Huawei Cloud
  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 2024) [Paper] [Code]
    • PKU & UCSD
  • Splitwise: Efficient Generative LLM Inference Using Phase Splitting (ISCA 2024) [arXiv] [Blog]
    • UW & Microsoft
    • Best Paper Award
    • Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines

LoRA Serving

  • dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (OSDI 2024) [Paper]
    • PKU & Shanghai AI Lab
  • CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]
    • HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud
  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]
    • UC Berkeley
  • Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]
    • UW & Duke

Speculative Decoding

  • Online Speculative Decoding (ICML 2024) [arXiv]
    • UC Berkeley & UCSD & Sisu Data & SJTU
  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]
    • CMU
  • Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]
    • UC Berkeley & ICSI & LBNL
  • Fast Inference from Transformers via Speculative Decoding (ICML 2023) [Paper]
    • Google Research

Offloading

  • FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]
    • Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
    • High-throughput serving; only use a single GPU.

Heterogeneous Environment

  • HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]
    • HKUST & ETH & CMU
    • Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
    • Propose a heuristic-based evolutionary algorithm to search for the optimal layout

Fairness

LLM Alignment

  • PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch (ATC 2024) [Paper]
    • THU

Acronyms

  • LLM: Large Language Model
  • LoRA: Low-Rank Adaptation