[Roadmap] vLLM Roadmap Q1 2025

This page is accessible via [roadmap.vllm.ai](https://roadmap.vllm.ai/)

This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the [vLLM Slack](https://slack.vllm.ai)

### vLLM Core
_These projects will deliver performance enhancements to majority of workloads running on vLLM, and the core team has assigned priorities to signal what must get done. Help is also wanted here, especially for people want to get more involved in the core of vLLM._

**Ship a performant and modular V1 architecture (#8779, #sig-v1)**
- [x] (P0) Optimized default path that is on by default
- [x] (P0) Speculative decoding (n-gram on by default)
- [ ] (P0) Efficient memory manager for different shapes of KV cache (#11382) 
- [x] (P1) Efficient structured decoding & Jump decoding in V1 (#11908)
- [x] (P1) Full multi-modal support in V1 (no support encoder-decoder models). 
- [x] (P1) Pipeline parallelism
- [x] (P1) LoRA  (#10957)
- [x] (P2) Hardware support: AMD first by Q1, TPU prototype. 
- [ ] ~(P2) Extension system: design ready.~

**Support large and long context models**
- [x] (P0) Expert Parallelism for MoE
- [ ] (P1) Productionize Prefill Disaggregation
- [ ] (P1) Productionize KV Cache offloading to CPU and disk
- [x] (P1) Explore Data Parallel for Attention
- [ ] (Help Wanted) Investigate context parallelism

**Improved performance in batch mode**
- [x] (P0) Optimized vLLM in post training workflow (#sig-post-training)
- [ ] (P2) Efficiency in batch inference and long generations

**Others**
- [x] (P0) Blackwell Support   
- [ ] (P1) Track vLLM Performance  
- [ ] (Help Wanted) Extensible sampler

### Model Support
- [x] Arbitrary HF model (#11330)
- [ ] Alternative or private checkpoint format

### Hardware Support
- [x] PagedAttention and Chunked Prefill on Trainium and Inferentia  
- [x] Productionize and support large scale deployment of vLLM on TPU
- [x] Progress in Gaudi Support
- [x] Out of tree support for IBM Spyre and Ascend (#11162)

### Optimizations
- [x] FlashAttention3 #12429
- [ ] ~AsyncTP~
- [ ] ~Design for sparse KV cache framework~

### CI and Developer Productivity
- [x] Wheel server  
- [x] Multi-platform wheels and docker  
- [ ] Better performance tracker  
- [ ] Easier installation (optional dependencies, separate kernel packages)

### Ecosystem Projects
_These are independent projects that we love to have native collaboration and integration with!_

- [x] Distributed batch inference  
- [x] Large scale serving
- [x] Production Stack (https://github.com/vllm-project/production-stack/issues/26)
- [ ] Multi-modality output  
- [x] Collaboration with HuggingFace
- [x] Collaboration with Ollama

---

If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #9006, #5805, #3861, #2681, #244  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Roadmap] vLLM Roadmap Q1 2025 #11862

vLLM Core

Model Support

Hardware Support

Optimizations

CI and Developer Productivity

Ecosystem Projects

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] vLLM Roadmap Q1 2025 #11862

Description

vLLM Core

Model Support

Hardware Support

Optimizations

CI and Developer Productivity

Ecosystem Projects

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions