[Roadmap] vLLM Roadmap Q4 2024

### Themes. 
As before, we categorized our roadmap into 6 broad themes: broad model support, wide hardware coverage, state of the art performance optimization, production level engine, strong OSS community, and extensible architectures. As we are seeing more  

### Broad Model Support
- [x] Enhance LLM Support
  - [x] Hybrid/Interleaved Attention (#9464)
- [x] Enhance Multi-Modality in vLLM (#4194)
- [x] Enhance Support for State Space Models (Mamba)
- [x] Reward Model API (#8967)
- [ ] Arbitrary HF model (a collaboration with Hugging Face!)
  - [ ] #11330
- [ ] Whisper

Help wanted:
- [x] Expand coverage for encoder-decoder models  (Bert, XLMRoberta, BGE, T5)
  - #9056
  - #10400
- [ ] API for streaming input (in particular for audio)

### Hardware Support
- [x] A feature matrix for all the hardware that vLLM supports, and their maturity level
- [ ] Expanding features support on various hardwares
  - [ ] Fast PagedAttention and Chunked Prefill on Inferentia
  - [x] Upstream of Intel Gaudi
  - [x] Enhancements in TPU Support
  - [x] Upstream enhancements in AMD MI300x
  - [x] Performance enhancement and measurement for NVIDIA H200
  - [ ] New accelerator support: IBM Spyre

Help wanted:
- [x] Design for pluggable, out-of-tree hardware backend similar to PyTorch’s PrivateUse API
- [ ] Prototype JAX support

### Performance Optimizations
- [ ] Turn on chunked prefill, prefix caching, speculative decoding by default
- [x] Optimizations for structured outputs
- [ ] Fused GEMM/all-reduce leveraging Flux and AsyncTP
- [ ] Enhancement and overhead-removal in offline LLM use cases. 
- [x] Better kernels (FA3, FlashInfer, FlexAttention, Triton)
- [x] Native integration with torch.compile

Help wanted:
- [ ] A fast ngrams speculator
- [ ] Sparse KV cache framework (#5751)
- [ ] Long context optimizations: context parallelism, etc.

### Production Features
- [x] KV cache offload to CPU and disk
- [x] Disaggregated Prefill
- [ ] More control in prefix caching, and scheduler policies
- [ ] Automated speculative decoding policy, see [Dynamic Speculative Decoding](https://arxiv.org/abs/2406.14066)

Help wanted
- [ ] Support multiple models in the same server

### OSS Community
- [x] Enhancements in performance benchmark: more realistic workload, more hardware backends (H200s)
- [x] Better developer documentations for getting started with contribution and research

Help wanted
- [x] Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)

### Extensible Architecture
- [x] Full support for torch.compile
- [x] vLLM Engine V2: Asynchronous Scheduling and Prefix Caching Centric Design (#8779)
- [ ] A generic memory manager supporting multi-modality, sparsity, and others

-----
If any of the items you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #5805, #3861, #2681, #244

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Roadmap] vLLM Roadmap Q4 2024 #9006

Themes.

Broad Model Support

Hardware Support

Performance Optimizations

Production Features

OSS Community

Extensible Architecture

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] vLLM Roadmap Q4 2024 #9006

Description

Themes.

Broad Model Support

Hardware Support

Performance Optimizations

Production Features

OSS Community

Extensible Architecture

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions