Closed
Description
This document includes the features in vLLM's roadmap for Q1 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.
In the future, we will publish our roadmap quarterly and deprecate our old roadmap (#244).
- OSS General
- Better benchmark scripts and standards (Serving Benchmark Refactoring #2433)
- Improve documentation
- CI/CD Testing and release process
- Make model and kernel tests working on current CI
- Automate release process
- Dev experience
- Explore Apple Silicon via Torch or MLX or llama cpp
- Cached and parallel build system (Call for Help: Proper Build System (CMake, Bazel, etc). #2654)
- Frontend
- Support structured output (contact: @simon-mo)
- Optimize the performance of the API server
- Scheduling
- Chunked prefill / dynamic splitfuse ([FEATURE] Implement Dynamic SplitFuse #1562)
- Speculative decoding ([WIP] Speculative decoding using a draft model #2188, Speculative Decoding #2607, merging plan, contact: @LiuXiaoxuanPKU)
- Automatic prefix caching ([RFC] Automatic Prefix Caching #2614, contact: @zhuohan123)
- Disaggregated prefill / splitwise (Add Splitwise: prompt and token phase separation #2472)
- Kernel performance optimization
- Quantization kernel optimization
- Support FP8 (RFC: FP8 in vLLM #2461)
- MoE kernel optimization (DeepseekMoE support with Fused MoE kernel #2453, Fused MOE for Mixtral #2542, and more)
- H100 performance (Any optimization options for H100? #2107)
- AMD MI300x Performance
- MQA kernel ([Performance] Use optimized kernels for MQA/GQA #1880)
- Port FlashInfer to vLLM (Import FlashInfer: 3x faster PagedAttention than vLLM #2767)
- Kernel for sampler
- Quantization kernel optimization
- Hardware support
vLLM team is working with the following hardware vendors:- AWS Inferentia ([RFC] Initial Support for AWS Inferentia #1866, Support inference with transformers-neuronx #2569)
- Google TPU
- Intel Gaudi
- Intel GPU/CPU ([Feature][WIP] Prototype of vLLM execution on Intel GPU devices via SYCL. #2378)
- Model support
- Multi-modal models (Add LLaVA support #775, Support generation from input embedding #1265, add llava model support #2153, feat: Input embeddings #2563)
- Encoder-decoder models (Adding support for encoder-decoder models, like T5 or BART #187, T5 model support #404)
- Embedding models (Support embedding models #458, Feature request: Support for embedding models #742)
- Future-proofing vLLM
-
torch.compile
support - Implement extensible scheduler and memory manager
-
Metadata
Metadata
Assignees
Labels
No labels