Homepage: https://sigops.org/s/conferences/sosp/2024/
- LLM Training
- Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
- PKU
- Perseus: Removing Energy Bloat from Large Model Training [arXiv]
- UMich
- Use a graph cut-based algorithm to obtain the "iteration time-energy" Pareto frontier; schedule the energy consumption across time.
- Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
- LLM Inference
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [arXiv]
- PKU
- ESP: Elastic Sequence Parallelism
- Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.
- PKU
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [arXiv]
- SJTU IPADS
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [arXiv]
- Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation
- GaTech & Princeton
- Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving [arXiv]
- Princeton & GaTech
- Automatically apply and manage early exits (certain inputs can exit with results at intermediate layers) in ML models.
- SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures [arXiv]
- Stanford
- Dynamically re-route the work of a failed server to data-parallel peers; execute within bubbles of the original pipeline schedule.
- Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections [arXiv]
- ICL
- Tenplex — a state management library.
- Enable jobs to change the parallelism dynamically.
- PTC: Parallelizable Tensor Collection
- Dataset state
- Modle state
- Execute PTC transformations in parallel with minimum data movement between workers.
- Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor [arXiv]
- UIUC & MSRA
- T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips (i.e., Graphcore IPU).
- SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference
- IISc
- Dirigent: Lightweight Serverless Orchestration [arXiv]
- ETH
- Simplify state management of the existing orchestration system (Kubernetes); eliminate persistent state updates; run monolithic control and data planes to minimize internal communication overheads.