Skip to content

v0.8.5

Compare
Choose a tag to compare
@github-actions github-actions released this 28 Apr 21:13
· 2707 commits to main since this release

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

  • Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
  • Add ModernBERT (#16648)
  • Add Granite Speech Support (#16246)
  • Add PLaMo2 (#14323)
  • Add Kimi-VL model support (#16387)
  • Add Qwen2.5-Omni model support (thinker only) (#15130)
  • Snowflake Arctic Embed (Family) (#16649)
  • Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

  • Add structural_tag support using xgrammar (#17085)
  • Disaggregated serving:
    • KV Connector API V1 (#15960)
    • Adding LMCache KV connector for v1 (#16625)
  • Clean up: Remove Sampler from Model Code (#17084)
  • MLA: Simplification to batch P/D reordering (#16673)
  • Move usage stats to worker and start logging TPU hardware (#16211)
  • Support FlashInfer Attention (#16684)
  • Faster incremental detokenization (#15137)
  • EAGLE-3 Support (#16937)

Features

  • Validate urls object for multimodal content parts (#16990)
  • Prototype support sequence parallelism using compilation pass (#16155)
  • Add sampling params to v1/audio/transcriptions endpoint (#16591)
  • Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
  • Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

  • Attention:
    • FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
    • Update to lastest FA3 code (#13111)
    • Support Cutlass MLA for Blackwell GPUs (#16032)
  • MoE:
    • Add expert_map support to Cutlass FP8 MOE (#16861)
    • Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
  • Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
  • Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

  • TPU:
    • Enable structured decoding on TPU V1 (#16499)
    • Capture multimodal encoder during model compilation (#15051)
    • Enable Top-P (#16843)
  • AMD:
    • AITER Fused MOE V1 Support (#16752)
    • Integrate Paged Attention Kernel from AITER (#15001)
    • Support AITER MLA (#15893)
    • Upstream prefix prefill speed up for vLLM V1 (#13305)
    • Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
    • Add skinny gemms for unquantized linear on ROCm (#15830)
    • Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

  • Add open-webui example (#16747)
  • Document Matryoshka Representation Learning support (#16770)
  • Add a security guide (#17230)
  • Add example to run DeepSeek with Ray Serve LLM (#17134)
  • Benchmarks for audio models (#16505)

Security and Dependency Updates

  • Don't bind tcp zmq socket to all interfaces (#17197)
  • Use safe serialization and fix zmq setup for mooncake pipe (#17192)
  • Bump Transformers to 4.51.3 (#17116)

Build and testing

  • Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

Breaking changes 🚨

  • --enable-chunked-prefill, --multi-step-stream-outputs, --disable-chunked-mm-input can no longer explicitly be set to False. Instead, add no- to the start of the argument (i.e. --enable-chunked-prefill and --no-enable-chunked-prefill) (#16533)

What's Changed

New Contributors

Full Changelog: v0.8.4...v0.8.5