Closed
Description
This document includes the features in vLLM's roadmap for Q2 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.
You can see our historical roadmap at #2681, #244. This roadmap contains work committed by the vLLM team from UC Berkeley, as well as the broader vLLM contributor groups including but not limited to Anyscale, IBM, NeuralMagic, Roblox, Oracle Cloud. You can also find help wanted items in this roadmap as well! Additionally, this roadmap is shaped by you, our user community!
Themes.
We categorized our roadmap into 6 broad themes:
- Broad model support: vLLM should support a wide range of transformer based models. It should be kept up to date as much as possible. This includes new auto-regressive decoder models, encoder-decoder models, hybrid architectures, and models supporting multi-modal inputs.
- Excellent hardware coverage: vLLM should run on a wide range of accelerators for production AI workload. This includes GPUs, tensor accelerators, and CPUs. We will work closely with hardware vendors to ensure vLLM utilizes the greatest performance out of the chip.
- Performance optimization:vLLM should be kept up to date with the latest performance optimization techniques. Users of vLLM can trust its performance to be competitive and strong.
- Production level engine: vLLM should be the go-to choice for production level serving engine with a suite of features bridging the gaps from single forward pass to 24/7 service.
- Strong OSS product: vLLM is and will be a true community project. We want it to be a healthy project with regular release cadence, good documentation, and adding new reviewers to the codebase.
- Extensible architectures: For vLLM to grow at an even faster pace, it needs good abstractions to support a wide range of scheduling policies, hardware backends, and inference optimizations. We will work on refactoring the codebase to support that.
Broad Model Support
- Encoder Decoder Models
- Hybrid Architecture (Jamba) [New Model]: Jamba (MoE Mamba from AI21) #3690
- Decoder Only Embedding Models [Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734
- Prefix tuning support
Help Wanted:
- More vision transformers beyond llava
- Support private model registration How to serve a private HF model? #172
- Control vector support [Feature]: Control vectors #3451
- Fallback support for arbitrary
transformers
text generation model - Long context investigation of LongRoPE
- RWKV
Excellent Hardware Coverage
- AMD MI300x: enhancing fp8 performance [enable FP8 compute]
- NVIDIA H100: enhancing fp8 performance
- AWS Trainium and Inferentia
- Google TPU
- Intel CPU
- Intel GPU
- Intel Gaudi
Performance Optimization
- Speculative decoding
- Speculative decoding framework for top-1 proposals w/draft model
- Proposer improvement: Prompt-lookup n-gram speculations
- Scoring improvement: Make batch expansion optional
- Scoring improvement: dynamic scoring length policy
- Kernels:
- FlashInfer integration Import FlashInfer: 3x faster PagedAttention than vLLM #2767
- Sampler optimizations leveraging triton compiler
- Quantization:
- FP8 format support for NVIDIA Ammo and AMD Quantizer
- Weight only quantization (Marlin) improvements: act_order, int8, Exllama2 compatibility, fused MoE, AWQ kernels.
- Activation quantization (W8A8, FP8, etc)
- Quantized lora support Add Support for QLORA/QA-QLORA weights which are not merged #3225
- AQLM quantization
- Constrained decoding performance (batch, async, acceleration) and extensibility (Outlines [Feature]: Update Outlines Integration from
FSM
toGuide
#3715, LMFormatEnforcer [Feature]: Integrate with lm-format-enforcer #3713, AICI AI Controller Interface (AICI) integration #2888 )
Help Wanted:
- Sparse kv cache (H2O, compression, FastDecode)
- Speculative decoding
- Proposer/scoring/verifier improvement: Top-k “tree attention” proposals for Eagle/Medusa/Draft model
- Proposer improvement: RAG n-gram speculations
- Proposer improvement: Eagle/Medusa top-1 proposals
- Proposer improvement: Quantized draft models
- Verifier improvement: Typical acceptance
Production Level Engine
- Scheduling
- Prototype Disaggregated prefill (How to use Splitwise(from microsoft) in vllm? #2370)
- Speculative decoding fully merged in ([WIP] Speculative decoding using a draft model #2188)
- Turn chunked prefill/sarathi/splitfuse on by default ([2/N] Chunked prefill data update #3538)
- Memory management
- Automatic prefix caching enhancement
- TGI feature parity (stop string handling, logging and metrics, test improvements)
- Provide non-ray option for single node inference
- Optimize api server performance
- OpenAI server feature completeness (function calling) (OpenAI Tools / function calling v2 #3237)
- Model Loading
- Optimize model weights loading by directly loading from hub/s3 [Feature]: Add model loading using CoreWeave's
tensorizer
#3533 - Fully offline mode
- Optimize model weights loading by directly loading from hub/s3 [Feature]: Add model loading using CoreWeave's
Help Wanted:
- Logging serving FLOPs for performance analysis
- Dynamic LoRA adapter downloads from hub/S3
Strong OSS Product
- Continuous benchmarks (resource needed!)
- Commit to 2wk release cadence
- Growing reviewer and committer base
- Better docs
- doc: memory and performance tuning guide
- doc: apc documentation
- doc: hardware support levels, feature matrix, and policies
- doc: guide to horizontally scale up vLLM service
- doc: developer guide for adding new draft based models or draft-less optimizations
- Automatic CD of nightly wheels and docker images
Help Wanted:
- ARM aarch-64 support for AWS Graviton based instances and GH200
- Full correctness test with HuggingFace transformers. Resources needed.
- Well tested support for
lm-eval-harness
(logprobs, get tokenizers) - Local development workflow without cuda
Extensible Architecture
- Prototype pipeline parallelism
- Extensible memory manager
- Extensible scheduler
-
torch.compile
investigations- use compile for quantization kernel fusion
- use compile for future proofing graph mode
- use compile for xpu or other accelerators
- Architecture for queue management and request prioritization
- Streaming LLM, prototype it on new block manager
- Investigate Tensor + Pipeline parallelism (LIGER)
Metadata
Metadata
Assignees
Labels
No labels