Description
This page is accessible via roadmap.vllm.ai
This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the vLLM Slack
Core Themes
Path to vLLM v1.0.0
We want to fully remove the V0 engine and clean up the codebase for unpopular and unsupported features. The v1.0.0 version of vLLM will be performant and easy to maintain, as well as modular and extensible, with backward compatibility.
- V1 core feature set
- Hybrid memory allocators
-
Jump decoding - Redesigned native support for pipeline parallelism
- Redesigned spec decode
- Redesigned sampler with modularity support
- Close the feature gaps and fully remove V0
- Attention backends
- Pooling models
- Mamba/Hybrid models
- (TBD) encoder and encoder decoder
- Hardware support
- Performance
- Further lower scheduler overhead
- Further enhance LoRA performance
- API Server Scale-out
Cluster Scale Serving
As the model expands in size, serving them in multi-node scale-out and disaggregating prefill and decode becomes the way to go. We are fully committed to making vLLM the best engine for cluster scale serving.
- Data Parallelism
- Single node DP
- API Server and Engine decoupling (any to any communication)
- Expert Parallelism
- DeepEP and pplx integrations
- Transition from fused_moe to cutlass based grouped gemm.
- Online Reconfiguration (e.g. EPLB)
- Online reconfiguration
- Zero-overhead expert movement
- Prefill Decode Disaggregation
- 1P1D in V1: both symmetric TP/PP and asymmetric TP/PP
- XPYD
- Data Parallel Compatibility
- NIXL integration
- Overhead Reduction & Performance Enhancements
- KV Cache Storage
- Offload KV cache to CPU
- Offload KV cache to disk
- Integration with Mooncake and LMCache
- DeepSeek Specific Enhancements
- MLA enhancements: TP, FlashAttention, FlashInfer, Blackwell Kernels.
- MTP enhancements: V1 support, further lower overhead.
- Others
- Investigate communication and compute pipelining
vLLM for Production
vLLM is designed for production. We will continue to enhance stability and tune the systems around vLLM for optimal performance.
- Testing:
- Comprehensive performance suite
- Enhance accuracy testing coverage
- Large-scale deployment + testing
- Stress and longevity testing
- Offer tuned recipes and analysis for different models and hardware combinations.
- Multi-platform wheels and containers for production use cases.
Features
Models
- Scaling Omni Modality
- Long Context
- Stable OOT model registration interface
- Attention Sparsity: support the sparse mechanism for new models.
Use Case
- Enhance testing and performance related to RLHF workflow
- Add data parallel routing for large-scale batch Inference
- Investigate batch size invariance and tran/inference equivalence.
Hardware
- Stable Plugin Architecture for hardware platforms
- Blackwell Enhancements
- Full Production readiness for AMD, TPU, Neuron.
Optimizations
- EAGLE3
- FP4 enhancements
- FlexAttention
- Investigate: fbgemm, torchao, cuTile
- …
Community
- Blogs
- Case Studies
- Website
- Onboarding tasks and new contributors training program
vLLM Ecosystem
-
Hardware Plugins
- vllm-ascend: vLLM Ascend Roadmap Q2 2025 vllm-ascend#448
-
AIBrix: v0.3.0 roadmap aibrix#698
-
Production Stack: [Roadmap] vLLM Production Stack roadmap for 2025 Q2 production-stack#300
-
Ray LLM: [llm] Roadmap for Data and Serve LLM APIs ray-project/ray#51313
-
LLM Compressor
-
GuideLLM
-
Dynamo
-
Prioritized Support for RLHF Systems: veRL, OpenRLHF, TRL, OpenInstruct, Fairseq2, ...
If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.
Historical Roadmap: #11862, #9006, #5805, #3861, #2681, #244