Skip to content

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed
Closed
@simon-mo

Description

@simon-mo

This document includes the features in vLLM's roadmap for Q2 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.

You can see our historical roadmap at #2681, #244. This roadmap contains work committed by the vLLM team from UC Berkeley, as well as the broader vLLM contributor groups including but not limited to Anyscale, IBM, NeuralMagic, Roblox, Oracle Cloud. You can also find help wanted items in this roadmap as well! Additionally, this roadmap is shaped by you, our user community!

Themes.

We categorized our roadmap into 6 broad themes:

  • Broad model support: vLLM should support a wide range of transformer based models. It should be kept up to date as much as possible. This includes new auto-regressive decoder models, encoder-decoder models, hybrid architectures, and models supporting multi-modal inputs.
  • Excellent hardware coverage: vLLM should run on a wide range of accelerators for production AI workload. This includes GPUs, tensor accelerators, and CPUs. We will work closely with hardware vendors to ensure vLLM utilizes the greatest performance out of the chip.
  • Performance optimization:vLLM should be kept up to date with the latest performance optimization techniques. Users of vLLM can trust its performance to be competitive and strong.
  • Production level engine: vLLM should be the go-to choice for production level serving engine with a suite of features bridging the gaps from single forward pass to 24/7 service.
  • Strong OSS product: vLLM is and will be a true community project. We want it to be a healthy project with regular release cadence, good documentation, and adding new reviewers to the codebase.
  • Extensible architectures: For vLLM to grow at an even faster pace, it needs good abstractions to support a wide range of scheduling policies, hardware backends, and inference optimizations. We will work on refactoring the codebase to support that.

Broad Model Support

Help Wanted:

Excellent Hardware Coverage

  • AMD MI300x: enhancing fp8 performance [enable FP8 compute]
  • NVIDIA H100: enhancing fp8 performance
  • AWS Trainium and Inferentia
  • Google TPU
  • Intel CPU
  • Intel GPU
  • Intel Gaudi

Performance Optimization

  • Speculative decoding
    • Speculative decoding framework for top-1 proposals w/draft model
    • Proposer improvement: Prompt-lookup n-gram speculations
    • Scoring improvement: Make batch expansion optional
    • Scoring improvement: dynamic scoring length policy
  • Kernels:
  • Quantization:

Help Wanted:

  • Sparse kv cache (H2O, compression, FastDecode)
  • Speculative decoding
    • Proposer/scoring/verifier improvement: Top-k “tree attention” proposals for Eagle/Medusa/Draft model
    • Proposer improvement: RAG n-gram speculations
    • Proposer improvement: Eagle/Medusa top-1 proposals
    • Proposer improvement: Quantized draft models
    • Verifier improvement: Typical acceptance

Production Level Engine

Help Wanted:

  • Logging serving FLOPs for performance analysis
  • Dynamic LoRA adapter downloads from hub/S3

Strong OSS Product

  • Continuous benchmarks (resource needed!)
  • Commit to 2wk release cadence
  • Growing reviewer and committer base
  • Better docs
    • doc: memory and performance tuning guide
    • doc: apc documentation
    • doc: hardware support levels, feature matrix, and policies
    • doc: guide to horizontally scale up vLLM service
    • doc: developer guide for adding new draft based models or draft-less optimizations
  • Automatic CD of nightly wheels and docker images

Help Wanted:

  • ARM aarch-64 support for AWS Graviton based instances and GH200
  • Full correctness test with HuggingFace transformers. Resources needed.
  • Well tested support for lm-eval-harness (logprobs, get tokenizers)
  • Local development workflow without cuda

Extensible Architecture

  • Prototype pipeline parallelism
  • Extensible memory manager
  • Extensible scheduler
  • torch.compile investigations
    • use compile for quantization kernel fusion
    • use compile for future proofing graph mode
    • use compile for xpu or other accelerators
  • Architecture for queue management and request prioritization
  • Streaming LLM, prototype it on new block manager
  • Investigate Tensor + Pipeline parallelism (LIGER)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions