[Roadmap] vLLM Roadmap Q2 2024

This document includes the features in vLLM's roadmap for Q2 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.

You can see our historical roadmap at #2681, #244. This roadmap contains work committed by the vLLM team from UC Berkeley, as well as the broader vLLM contributor groups including but not limited to Anyscale, IBM, NeuralMagic, Roblox, Oracle Cloud. You can also find help wanted items in this roadmap as well! Additionally, this roadmap is shaped by you, our user community!

### Themes. 

We categorized our roadmap into 6 broad themes:

* **Broad model support**: vLLM should support a wide range of transformer based models. It should be kept up to date as much as possible. This includes new auto-regressive decoder models, encoder-decoder models, hybrid architectures, and models supporting multi-modal inputs. 
* **Excellent hardware coverage**: vLLM should run on a wide range of accelerators for production AI workload. This includes GPUs, tensor accelerators, and CPUs. We will work closely with hardware vendors to ensure vLLM utilizes the greatest performance out of the chip. 
* **Performance optimization**:vLLM should be kept up to date with the latest performance optimization techniques. Users of vLLM can trust its performance to be competitive and strong.
* **Production level engine**: vLLM should be the go-to choice for production level serving engine with a suite of features bridging the gaps from single forward pass to 24/7 service. 
* **Strong OSS product**: vLLM is and will be a true community project. We want it to be a healthy project with regular release cadence, good documentation, and adding new reviewers to the codebase.
* **Extensible architectures**: For vLLM to grow at an even faster pace, it needs good abstractions to support a wide range of scheduling policies, hardware backends, and inference optimizations. We will work on refactoring the codebase to support that.


### Broad Model Support
- [ ] Encoder Decoder Models
  - [ ] T5 #3117
  - [ ] Whisper
  - [x] Embedding #3187 
- [ ] Hybrid Architecture (Jamba) #3690 
- [x] Decoder Only Embedding Models #3734
- [ ] Prefix tuning support

Help Wanted:
- [x] More vision transformers beyond llava
- [x] Support private model registration #172 
- [ ] Control vector support #3451 
- [ ] Fallback support for arbitrary `transformers` text generation model
- [ ] Long context investigation of LongRoPE
- [ ] RWKV

### Excellent Hardware Coverage
- [x] AMD MI300x: enhancing fp8 performance [enable FP8 compute]
- [x] NVIDIA H100: enhancing fp8 performance
- [x] AWS Trainium and Inferentia
- [x] Google TPU
- [x] Intel CPU
- [x] Intel GPU
- [x] Intel Gaudi

### Performance Optimization
* Speculative decoding
  - [x] Speculative decoding framework for top-1 proposals w/draft model
  - [x] Proposer improvement: Prompt-lookup n-gram speculations
  - [ ] Scoring improvement: Make batch expansion optional
  - [ ] Scoring improvement: dynamic scoring length policy
* Kernels:
  - [x] FlashInfer integration #2767 
  - [ ] Sampler optimizations leveraging triton compiler
* Quantization:
  - [x] FP8 format support for NVIDIA Ammo and AMD Quantizer
  - [x] Weight only quantization (Marlin) improvements: act_order, int8, Exllama2 compatibility, fused MoE, AWQ kernels.
  - [x] Activation quantization (W8A8, FP8, etc)
  - [x] Quantized lora support #3225
  - [x] AQLM quantization
- [ ] Constrained decoding performance (batch, async, acceleration) and extensibility (Outlines #3715, LMFormatEnforcer #3713, AICI #2888 ) 


Help Wanted:
* Sparse kv cache (H2O, compression, FastDecode)
* Speculative decoding
  - [ ] Proposer/scoring/verifier improvement: Top-k “tree attention” proposals for Eagle/Medusa/Draft model
  - [x] Proposer improvement: RAG n-gram speculations
  - [ ] Proposer improvement: Eagle/Medusa top-1 proposals
  - [ ] Proposer improvement: Quantized draft models
  - [ ] Verifier improvement: Typical acceptance

### Production Level Engine
* Scheduling
  - [x] Prototype Disaggregated prefill (#2370)
  - [x] Speculative decoding fully merged in (#2188)
  - [x] Turn chunked prefill/sarathi/splitfuse on by default (#3538)
* Memory management
  - [x] Automatic prefix caching enhancement
- [x] TGI feature parity (stop string handling, logging and  metrics, test improvements)
- [x] Provide non-ray option for single node inference
- [ ] Optimize api server performance
- [ ] OpenAI server feature completeness (function calling) (#3237)
- Model Loading
  - [x] Optimize model weights loading by directly loading from hub/s3 #3533 
  - [x] Fully offline mode

Help Wanted:
- [ ] Logging serving FLOPs for performance analysis 
- [ ] Dynamic LoRA adapter downloads from hub/S3

### Strong OSS Product
- [x] Continuous benchmarks (resource needed!)
- [x] Commit to 2wk release cadence
- [x] Growing reviewer and committer base
* Better docs
  - [x] doc: memory and performance tuning guide
  - [x] doc: apc documentation
  - [ ] doc: hardware support levels, feature matrix, and policies
  - [ ] doc: guide to horizontally scale up vLLM service
  - [ ] doc: developer guide for adding new draft based models or draft-less optimizations
- [ ] Automatic CD of nightly wheels and docker images

Help Wanted:
* ARM aarch-64 support for AWS Graviton based instances and GH200
* Full correctness test with HuggingFace transformers. Resources needed.
* Well tested support for `lm-eval-harness` (logprobs, get tokenizers)
* Local development workflow without cuda

### Extensible Architecture 

- [x] Prototype pipeline parallelism
- [x] Extensible memory manager
- [ ] Extensible scheduler
- [ ] `torch.compile` investigations
  - [x] use compile for quantization kernel fusion
  - [ ] use compile for future proofing graph mode
  - [ ] use compile for xpu or other accelerators
- Architecture for queue management and request prioritization
- Streaming LLM, prototype it on new block manager
- Investigate Tensor + Pipeline parallelism ([LIGER](https://dl.acm.org/doi/pdf/10.1145/3627535.3638466))


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Roadmap] vLLM Roadmap Q2 2024 #3861

Themes.

Broad Model Support

Excellent Hardware Coverage

Performance Optimization

Production Level Engine

Strong OSS Product

Extensible Architecture

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] vLLM Roadmap Q2 2024 #3861

Description

Themes.

Broad Model Support

Excellent Hardware Coverage

Performance Optimization

Production Level Engine

Strong OSS Product

Extensible Architecture

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions