Homepage: https://www.usenix.org/conference/osdi24
Paper list: https://www.usenix.org/conference/osdi24/technical-sessions
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [Paper] [Code]
- MSR India & GaTech
- Sarathi-Serve
- Chunked-prefills: split a prefill request into near equal-sized chunks; create stall-free schedules that add new requests in a batch without pausing ongoing decodes.
- Stall-free scheduling: improve throughput with large batch sizes; minimize the effect of batching on latency.
- ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [Paper] [Code]
- Edinburgh
- Multi-tier checkpoint loading.
- Live migration of LLM inference: the source server migrates only the tokens; a re-computation of the KV-cache is triggered at the destination server.
- Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [Paper]
- Seoul National University
- InfiniGen: a KV cache management framework for long-text generation.
- Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.
- Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.
- Llumnix: Dynamic Scheduling for Large Language Model Serving [Paper] [Code]
- Alibaba
- Reschedule requests to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.
- Live migration for requests and the in-memory states (tokens).
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [Paper] [Code]
- PKU & UCSD
- Disaggregate the prefill and decoding computation.
- Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving [Paper]
- PKU & Shanghai AI Lab
- A credit-based batching algorithm to decide when to merge and unmerge LoRA adapters with the base model.
- A request-adapter co-migration algorithm to decide when to migrate between different worker replicas.
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable [Paper] [Code]
- SJTU & MSRA
- Semantic Variable: a unified abstraction to expose application-level knowledge to public LLM services.
- Annotate an input/output variable in the prompt of a request.
- Create the data pipeline when connecting multiple LLM requests.
- Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.
- Implemented on Python.
- Fairness in Serving Large Language Models [Paper] [Code]
- UC Berkeley
- This is the first work to discuss the fair serving of LLMs.
- Propose a fair-serving algorithm called Virtual Token Counter (VTC).
- Track the services received for each client.
- Prioritize the ones with the least services received.
- Only manipulate the dispatch order and don't reject a request if it can fit in the batch.
- Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences [Paper]
- Meta Platforms
- Main challenges for a resource-allocation framework.
- Usability: how to translate real-life policies into precise mathematical formulas.
- Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.
- Rebalancer: Meta's resource-allocation framework.
- An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).
- A high-level specification language to lower the barrier for adoption by system practitioners (for usability).
- When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling [Paper] [Code]
- Tufts
- PCS: Predictability-Centric Scheduling
- Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).
- Use a simulation-aided search strategy to discover WFQ configurations.
- MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale [Paper]
- Meta Platforms
- MAST: ML Application Scheduler on Twine
- Provide a global-scheduling abstraction to all ML training workloads.
- Three design principles: temporal decoupling, scope decoupling, and exhaustive search.
- nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training [Paper] [Code]
- USTC & MSRA & xAI & BaseBit Technologies
- Empower domain experts to construct their own search space through three primitives,
op-trans
,op-assign
, andop-order
. - Allow the application of constraints to those primitives during space construction.
- Usher: Holistic Interference Avoidance for Resource Optimized ML Inference [Paper] [Code]
- UVA & GaTech
- Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).
- GPU kernel-based model resource requirement estimator.
- Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.
- Operator graph merger to merge multiple models to minimize interference in GPU cache.
- Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning [Paper] [Code]
- USTC & Huawei & ByteDance & Hunan University
- Tensor Language Model (TLM)
- Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation [Paper] [Code]
- MSRA
- MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures [Paper] [Code]
- Sydney & Alibaba
- The code is currently not available.
- ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications [Paper] [Code]
- UChicago & ECNU & MSR
- Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents [Paper] [Code]
- Stanford & Princeton & Sapienza University of Rome & UMich
- Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel [Paper]
- Huawei Central Software Institute & SJTU
- Hong-Meng kernel (HM)
- Managing Memory Tiers with CXL in Virtualized Environments [Paper]
- Columbia & Microsoft Azure & UW & Carl Waldspurger Consulting & Intel & UW-Madison & UMich
- Beaver: Practical Partial Snapshots for Distributed Cloud Services [Paper] [Code]
- UPenn & SJTU & Princeton & Microsoft & UW
- High-throughput and Flexible Host Networking for Accelerated Computing [Paper] [Code]
- Stanford & Cornell & Enfabrica
- ACCL+: an FPGA-Based Collective Engine for Distributed Applications [Paper]
- ETH & Amsterdam & AMD
- Burstable Cloud Block Storage with Data Processing Units [Paper]
- PKU & Alibaba Cloud
- Anvil: Verifying Liveness of Cluster Management Controllers [Paper] [Code]
- UIUC & UW-Madison & VMware Research & Feldera
- Best Paper Award
- Notes from SJTU IPADS (in Chinese)
- OSDI 2024 论文评述 Day 1 Session 1: Memory Management - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 1 Session 2: Low-Latency LLM Serving - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 1 Session 3: Distributed Systems - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 2 Session 4: Deep Learning - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 2 Session 5: Operating Systems - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 2 Session 6: Cloud Computing - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 2 Session 7: Formal Verification - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 3 Session 8: Cloud Security - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 3 Session 9: Data Management - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 3 Session 10: Analysis of Correctness - IPADS-SYS 的文章 - 知乎
- OSDI 2024 论文评述 Day 3 Session 11: ML Scheduling - IPADS-SYS 的文章 - 知乎