Description
This page is accessible via roadmap.vllm.ai
This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the vLLM Slack
vLLM Core
These projects will deliver performance enhancements to majority of workloads running on vLLM, and the core team has assigned priorities to signal what must get done. Help is also wanted here, especially for people want to get more involved in the core of vLLM.
Ship a performant and modular V1 architecture (#8779, #sig-v1)
- (P0) Optimized default path that is on by default
- (P0) Speculative decoding (n-gram on by default)
- (P0) Efficient memory manager for different shapes of KV cache ([RFC]: Hybrid Memory Allocator #11382)
- (P1) Efficient structured decoding & Jump decoding in V1 ([RFC]: Implement Structured Output support for V1 engine #11908)
- (P1) Full multi-modal support in V1 (no support encoder-decoder models).
- (P1) Pipeline parallelism
- (P1) LoRA ([V1] LoRA Support #10957)
- (P2) Hardware support: AMD first by Q1, TPU prototype.
-
(P2) Extension system: design ready.
Support large and long context models
- (P0) Expert Parallelism for MoE
- (P1) Productionize Prefill Disaggregation
- (P1) Productionize KV Cache offloading to CPU and disk
- (P1) Explore Data Parallel for Attention
- (Help Wanted) Investigate context parallelism
Improved performance in batch mode
- (P0) Optimized vLLM in post training workflow (#sig-post-training)
- (P2) Efficiency in batch inference and long generations
Others
- (P0) Blackwell Support
- (P1) Track vLLM Performance
- (Help Wanted) Extensible sampler
Model Support
- Arbitrary HF model ([Model]: Add
transformers
backend support #11330) - Alternative or private checkpoint format
Hardware Support
- PagedAttention and Chunked Prefill on Trainium and Inferentia
- Productionize and support large scale deployment of vLLM on TPU
- Progress in Gaudi Support
- Out of tree support for IBM Spyre and Ascend ([RFC]: Hardware pluggable #11162)
Optimizations
- FlashAttention3 Flash Attention 3 (FA3) Support #12429
-
AsyncTP -
Design for sparse KV cache framework
CI and Developer Productivity
- Wheel server
- Multi-platform wheels and docker
- Better performance tracker
- Easier installation (optional dependencies, separate kernel packages)
Ecosystem Projects
These are independent projects that we love to have native collaboration and integration with!
- Distributed batch inference
- Large scale serving
- Production Stack ([Roadmap] vLLM production stack roadmap for 2025 Q1 production-stack#26)
- Multi-modality output
- Collaboration with HuggingFace
- Collaboration with Ollama
If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.