Skip to content

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open
@simon-mo

Description

@simon-mo

Themes.

As before, we categorized our roadmap into 6 broad themes: broad model support, wide hardware coverage, state of the art performance optimization, production level engine, strong OSS community, and extensible architectures. As we are seeing more

Broad Model Support

Help wanted:

Hardware Support

  • A feature matrix for all the hardware that vLLM supports, and their maturity level
  • Expanding features support on various hardwares
    • Fast PagedAttention and Chunked Prefill on Inferentia
    • Upstream of Intel Gaudi
    • Enhancements in TPU Support
    • Upstream enhancements in AMD MI300x
    • Performance enhancement and measurement for NVIDIA H200
    • New accelerator support: IBM Spyre

Help wanted:

  • Design for pluggable, out-of-tree hardware backend similar to PyTorch’s PrivateUse API
  • Prototype JAX support

Performance Optimizations

  • Turn on chunked prefill, prefix caching, speculative decoding by default
  • Optimizations for structured outputs
  • Fused GEMM/all-reduce leveraging Flux and AsyncTP
  • Enhancement and overhead-removal in offline LLM use cases.
  • Better kernels (FA3, FlashInfer, FlexAttention, Triton)
  • Native integration with torch.compile

Help wanted:

Production Features

  • KV cache offload to CPU and disk
  • Disaggregated Prefill
  • More control in prefix caching, and scheduler policies
  • Automated speculative decoding policy, see Dynamic Speculative Decoding

Help wanted

  • Support multiple models in the same server

OSS Community

  • Enhancements in performance benchmark: more realistic workload, more hardware backends (H200s)
  • Better developer documentations for getting started with contribution and research

Help wanted

  • Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)

Extensible Architecture

  • Full support for torch.compile
  • vLLM Engine V2: Asynchronous Scheduling and Prefix Caching Centric Design (vLLM's V1 Engine Architecture #8779)
  • A generic memory manager supporting multi-modality, sparsity, and others

If any of the items you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #5805, #3861, #2681, #244

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions