Open
Description
Themes.
As before, we categorized our roadmap into 6 broad themes: broad model support, wide hardware coverage, state of the art performance optimization, production level engine, strong OSS community, and extensible architectures. As we are seeing more
Broad Model Support
- Enhance LLM Support
- Hybrid/Interleaved Attention ([Feature]: Alternating local-global attention layers #9464)
- Enhance Multi-Modality in vLLM ([RFC]: Multi-modality Support on vLLM #4194)
- Enhance Support for State Space Models (Mamba)
- Reward Model API ([RFC]: Reward Modelling in OpenAI Compatible Server #8967)
- Arbitrary HF model (a collaboration with Hugging Face!)
- Whisper
Help wanted:
- Expand coverage for encoder-decoder models (Bert, XLMRoberta, BGE, T5)
- API for streaming input (in particular for audio)
Hardware Support
- A feature matrix for all the hardware that vLLM supports, and their maturity level
- Expanding features support on various hardwares
- Fast PagedAttention and Chunked Prefill on Inferentia
- Upstream of Intel Gaudi
- Enhancements in TPU Support
- Upstream enhancements in AMD MI300x
- Performance enhancement and measurement for NVIDIA H200
- New accelerator support: IBM Spyre
Help wanted:
- Design for pluggable, out-of-tree hardware backend similar to PyTorch’s PrivateUse API
- Prototype JAX support
Performance Optimizations
- Turn on chunked prefill, prefix caching, speculative decoding by default
- Optimizations for structured outputs
- Fused GEMM/all-reduce leveraging Flux and AsyncTP
- Enhancement and overhead-removal in offline LLM use cases.
- Better kernels (FA3, FlashInfer, FlexAttention, Triton)
- Native integration with torch.compile
Help wanted:
- A fast ngrams speculator
- Sparse KV cache framework ([RFC]: Support sparse KV cache framework #5751)
- Long context optimizations: context parallelism, etc.
Production Features
- KV cache offload to CPU and disk
- Disaggregated Prefill
- More control in prefix caching, and scheduler policies
- Automated speculative decoding policy, see Dynamic Speculative Decoding
Help wanted
- Support multiple models in the same server
OSS Community
- Enhancements in performance benchmark: more realistic workload, more hardware backends (H200s)
- Better developer documentations for getting started with contribution and research
Help wanted
- Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)
Extensible Architecture
- Full support for torch.compile
- vLLM Engine V2: Asynchronous Scheduling and Prefix Caching Centric Design (vLLM's V1 Engine Architecture #8779)
- A generic memory manager supporting multi-modality, sparsity, and others
If any of the items you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.
Metadata
Metadata
Assignees
Labels
No labels