[Roadmap] vLLM Roadmap Q2 2025

This page is accessible via [roadmap.vllm.ai](https://roadmap.vllm.ai/)

This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the [vLLM Slack](https://slack.vllm.ai)

---

#### Core Themes

**Path to vLLM v1.0.0**  
*We want to fully remove the V0 engine and clean up the codebase for unpopular and unsupported features. The v1.0.0 version of vLLM will be performant and easy to maintain, as well as modular and extensible, with backward compatibility.*

- [ ] V1 core feature set  
    - [x] Hybrid memory allocators  
    - [ ] ~Jump decoding~
    - [x] Redesigned native support for pipeline parallelism   
    - [x] Redesigned spec decode  
    - [ ] Redesigned sampler with modularity support
- [ ] Close the feature gaps and fully remove V0  
    - [x] Attention backends
    - [ ] Pooling models  
    - [ ] Mamba/Hybrid models  
    - [ ] (TBD) encoder and encoder decoder  
    - [x] Hardware support  
- [ ] Performance  
    - [ ] Further lower scheduler overhead   
    - [x] Further enhance LoRA performance  
    - [x] API Server Scale-out

**Cluster Scale Serving**  
*As the model expands in size, serving them in multi-node scale-out and disaggregating prefill and decode becomes the way to go. We are fully committed to making vLLM the best engine for cluster scale serving.* 

- [x] Data Parallelism  
    - [x] Single node DP
    - [x] API Server and Engine decoupling (any to any communication)   
- [x] Expert Parallelism  
    - [x] DeepEP and pplx integrations  
    - [x] Transition from fused\_moe to cutlass based grouped gemm.   
- [ ] Online Reconfiguration (e.g. EPLB)  
    - [ ] Online reconfiguration  
    - [ ] Zero-overhead expert movement  
- [ ] Prefill Decode Disaggregation  
    - [x] 1P1D in V1: both symmetric TP/PP and asymmetric TP/PP  
    - [x] XPYD  
    - [x] Data Parallel Compatibility  
    - [x] NIXL integration  
    - [ ] Overhead Reduction & Performance Enhancements  
- [ ] KV Cache Storage   
    - [ ] Offload KV cache to CPU  
    - [ ] Offload KV cache to disk  
    - [x] Integration with Mooncake and LMCache  
- [ ] DeepSeek Specific Enhancements  
    - [x] MLA enhancements: TP, FlashAttention, FlashInfer, Blackwell Kernels.   
    - [x] MTP enhancements: V1 support, further lower overhead.   
- [ ] Others  
    - [ ] Investigate communication and compute pipelining

**vLLM for Production**  
*vLLM is designed for production. We will continue to enhance stability and tune the systems around vLLM for optimal performance.* 

- [ ] Testing:   
    - [ ] Comprehensive performance suite  
    - [ ] Enhance accuracy testing coverage  
    - [ ] Large-scale deployment \+ testing  
    - [ ] Stress and longevity testing  
- [ ] Offer tuned recipes and analysis for different models and hardware combinations.   
- [ ] Multi-platform wheels and containers for production use cases.

#### Features

**Models**

- [ ] Scaling Omni Modality  
- [ ] Long Context   
- [ ] Stable OOT model registration interface   
- [ ] Attention Sparsity: support the sparse mechanism for new models. 

**Use Case**

- [ ] Enhance testing and performance related to RLHF workflow  
- [ ] Add data parallel routing for large-scale batch Inference  
- [ ] Investigate batch size invariance and tran/inference equivalence. 

**Hardware**

- [x] Stable Plugin Architecture for hardware platforms  
- [x] Blackwell Enhancements  
- [ ] Full Production readiness for AMD, TPU, Neuron.

**Optimizations**

- [x] EAGLE3  
- [x] FP4 enhancements  
- [x] FlexAttention  
- [ ] Investigate: fbgemm, torchao, cuTile  
- [ ] …

#### Community

- [ ] Blogs   
- [ ] Case Studies  
- [ ] Website  
- [ ] Onboarding tasks and new contributors training program

#### vLLM Ecosystem

* Hardware Plugins
  * vllm-ascend: https://github.com/vllm-project/vllm-ascend/issues/448

* AIBrix: [https://github.com/vllm-project/aibrix/issues/698](https://github.com/vllm-project/aibrix/issues/698)  
* Production Stack: [https://github.com/vllm-project/production-stack/issues/300](https://github.com/vllm-project/production-stack/issues/300)  
* Ray LLM: [https://github.com/ray-project/ray/issues/51313](https://github.com/ray-project/ray/issues/51313)  
* LLM Compressor 
* GuideLLM
* Dynamo
* Prioritized Support for RLHF Systems: veRL, OpenRLHF, TRL, OpenInstruct, Fairseq2, ...

---

If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #11862, #9006, #5805, #3861, #2681, #244  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Roadmap] vLLM Roadmap Q2 2025 #15735

Core Themes

Features

Community

vLLM Ecosystem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] vLLM Roadmap Q2 2025 #15735

Description

Core Themes

Features

Community

vLLM Ecosystem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions