Release v0.8.5 · vllm-project/vllm

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
Add ModernBERT (#16648)
Add Granite Speech Support (#16246)
Add PLaMo2 (#14323)
Add Kimi-VL model support (#16387)
Add Qwen2.5-Omni model support (thinker only) (#15130)
Snowflake Arctic Embed (Family) (#16649)
Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

Add structural_tag support using xgrammar (#17085)
Disaggregated serving:
- KV Connector API V1 (#15960)
- Adding LMCache KV connector for v1 (#16625)
Clean up: Remove Sampler from Model Code (#17084)
MLA: Simplification to batch P/D reordering (#16673)
Move usage stats to worker and start logging TPU hardware (#16211)
Support FlashInfer Attention (#16684)
Faster incremental detokenization (#15137)
EAGLE-3 Support (#16937)

Features

Validate urls object for multimodal content parts (#16990)
Prototype support sequence parallelism using compilation pass (#16155)
Add sampling params to v1/audio/transcriptions endpoint (#16591)
Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

Attention:
- FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
- Update to lastest FA3 code (#13111)
- Support Cutlass MLA for Blackwell GPUs (#16032)
MoE:
- Add expert_map support to Cutlass FP8 MOE (#16861)
- Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

TPU:
- Enable structured decoding on TPU V1 (#16499)
- Capture multimodal encoder during model compilation (#15051)
- Enable Top-P (#16843)
AMD:
- AITER Fused MOE V1 Support (#16752)
- Integrate Paged Attention Kernel from AITER (#15001)
- Support AITER MLA (#15893)
- Upstream prefix prefill speed up for vLLM V1 (#13305)
- Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
- Add skinny gemms for unquantized linear on ROCm (#15830)
- Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

Add open-webui example (#16747)
Document Matryoshka Representation Learning support (#16770)
Add a security guide (#17230)
Add example to run DeepSeek with Ray Serve LLM (#17134)
Benchmarks for audio models (#16505)

Security and Dependency Updates

Don't bind tcp zmq socket to all interfaces (#17197)
Use safe serialization and fix zmq setup for mooncake pipe (#17192)
Bump Transformers to 4.51.3 (#17116)

Build and testing

Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

Breaking changes 🚨

--enable-chunked-prefill, --multi-step-stream-outputs, --disable-chunked-mm-input can no longer explicitly be set to False. Instead, add no- to the start of the argument (i.e. --enable-chunked-prefill and --no-enable-chunked-prefill) (#16533)

What's Changed

Improve configs - SchedulerConfig by @hmellor in #16533
[Misc] remove warning if triton>=3.2.0 by @DefTruth in #16553
[Misc] refactor examples by @reidliu41 in #16563
[Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in #16523
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in #16048
[Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in #16593
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 by @NickLucche in #16596
Fix triton install condition on CPU by @hmellor in #16600
s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in #16036
[Model][VLM] Add Kimi-VL model support by @courage17340 in #16387
[Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in #16616
[DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in #16614
config check sleep mode support oot platforms by @celestialli in #16562
[Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in #16390
[Kernel] moe wna16 marlin kernel by @jinzhen-lin in #14447
[BugFix]: Update minimum pyzmq version by @taneem-ibrahim in #16549
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in #16623
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in #16631
Add vllm bench [latency, throughput] CLI commands by @mgoin in #16508
Fix vLLM x torch.compile config caching by @zou3519 in #16491
[Misc] refactor argument parsing in examples by @reidliu41 in #16635
[CI/Build] Fix LoRA OOM by @jeejeelee in #16624
Add "/server_info" endpoint in api_server to retrieve the vllm_config. by @Cangxihui in #16572
[Kernel] Remove redundant Exp calculations by @DefTruth in #16123
[Misc] Update compressed-tensors WNA16 to support zero-points by @dsikka in #14211
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in #10546
[Model] Add PLaMo2 by @Alnusjaponica in #14323
[Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in #16628
[Misc] Modify LRUCache touch by @jeejeelee in #16689
Disable remote caching when calling compile_fx by @zou3519 in #16611
[Feature] add model aware kv ops helper by @billishyahao in #16020
[ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in #16664
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py by @shen-shanshan in #16578
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook by @yankay in #16405
[Misc] refactor examples series by @reidliu41 in #16708
[Doc] Improve OOM troubleshooting by @DarkLight1337 in #16704
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in #16693
[Model] support modernbert by @xsank in #16648
[Hardware] Add processor inputs to platform validation by @joerunde in #16680
Improve error for structured output backend selection by @hmellor in #16717
[Misc] Remove redundant comment by @jianzs in #16703
Help user create custom model for Transformers backend remote code models by @hmellor in #16719
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in #16432
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in #16636
Adding vllm buildkite job for IBM Power by @AaruniAggarwal in #16679
[V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in #11737
[rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in #16426
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in #16734
[Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in #16741
[V1] Remove log noise when idle by @russellb in #16735
[Ray] Improve documentation on batch inference by @richardliaw in #16609
[misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in #16760
[Doc] Add more tips to avoid OOM by @DarkLight1337 in #16765
[doc] add open-webui example by @reidliu41 in #16747
[Bugfix] Fix GLM4 model by @intervitens in #16618
[Doc] Fix a 404 link in installation/cpu.md by @windsonsea in #16773
[Misc] refactor examples series - lmcache by @reidliu41 in #16758
Improve configs - TokenizerPoolConfig + DeviceConfig by @hmellor in #16603
fix: hyperlink by @reidliu41 in #16778
[Doc] Make sure to update vLLM when installing latest code by @DarkLight1337 in #16781
[Doc] Document Matryoshka Representation Learning support by @noooop in #16770
[Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion by @insukim1994 in #16784
[V1][Perf] Faster incremental detokenization by @njhill in #15137
[Bugfix]Fix index out of range error in api server log by @WangErXiao in #16787
[Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 by @Ximingwang-09 in #16753
[Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small by @lengrongfu in #16548
[TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even by @NickLucche in #16726
[V1][TPU] Enable Top K by @NickLucche in #15489
[ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints by @sijiac in #16674
[V1][Metrics] Fix http metrics middleware by @markmc in #15894
[MLA] Simplification to batch P/D reordering by @njhill in #16673
[P/D][V1] KV Connector API V1 by @ApostaC in #15960
[Attention] Update to lastest FA3 code by @LucasWilkinson in #13111
Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema by @tarukumar in #16721
[Doc] Improve help examples for --compilation-config by @DarkLight1337 in #16729
[Misc] Update outdated note: LMCache now supports chunked prefill by @chaunceyjiang in #16697
[V1][Structured Output] Minor modification to _validate_structured_output() by @shen-shanshan in #16748
Add hardware print to TPU V1 test by @mgoin in #16792
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales by @LucasWilkinson in #16801
Improve configs - MultiModalConfig + PoolerConfig + DecodingConfig by @hmellor in #16789
[Misc] add collect_env to cli and docker image by @lengrongfu in #16759
[ROCm] [Attention] Cleanup ROCm output passing by @ProExpertProg in #16431
[Bugfix] fix pp for llama4 by @luccafong in #16746
[Doc] add podman setup instructions for official image by @nathan-weinberg in #16796
[Docs] Fix a link and grammar issue in production-stack.md by @windsonsea in #16809
[Model] use AutoWeightsLoader for BigCode, GPT-J by @jonghyunchoe in #16823
[Misc] Clean up Kimi-VL by @DarkLight1337 in #16833
Fix nullable_kvs fallback by @hmellor in #16837
[New Model]: Snowflake Arctic Embed (Family) by @noooop in #16649
[Misc] refactor examples series - Chat Completion Client With Tools by @reidliu41 in #16829
[Doc] Updated Llama section in tool calling docs to have llama 3.2 config info by @jmho in #16857
publish neuron docker image by @omrishiv in #16733
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) by @fyabc in #15130
[rocm][MI300] llama4 maverick fp8 moe config tp8 by @divakar-amd in #16847
[Frontend] Add sampling params to v1/audio/transcriptions endpoint by @NickLucche in #16591
[Misc] Benchmarks for audio models by @NickLucche in #16505
[V1][Misc] stop update prefix cache stats when logs_stats is disabled by @vie-serendipity in #16460
[Model] Refactor Phi-4-multimodal to use merged processor and support V1 by @Isotr0py in #15477
[Model] Qwen2.5-Omni Cleanup by @ywang96 in #16872
[VLM] Clean up models by @DarkLight1337 in #16873
[doc] update hyperlink by @reidliu41 in #16877
Log how much time loading a compiled artifact takes by @zou3519 in #16848
Serialize tensors using int8 views by @p88h in #16866
Improve configs - CacheConfig by @hmellor in #16835
[easy] Pass compile_fx only the config patches by @zou3519 in #16845
[Bugfix] Fix v1/spec_decode/test_ngram.py by @zixi-qi in #16895
[CI/CD][V1] Add spec decode tests to CI by @WoosukKwon in #16900
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in #16907
[Doc] Split dummy_processor_inputs() in Multimodal Docs by @alex-jw-brooks in #16915
Restore buffers when wake up from level 2 sleep (#16564) by @fingertap in #16889
[Misc] fix collect_env version parse by @wangxiyuan in #15267
[Misc] Refactor platform to get device specific stream and event by @shen-shanshan in #14411
[Bugfix] Fix GLM rotary_dim issue and support v1 by @Isotr0py in #16912
Raise error for data-parallel with benchmark_throughput by @kartikx in #16737
[XPU][Bugfix] minor fix for XPU by @yma11 in #15591
[doc] install required python3-dev apt package by @davidxia in #16888
[Doc] mention how to install in CPU editable mode by @davidxia in #16923
[Core] Speed up decode by remove synchronizing operation in sampler by @chanh in #16436
[V1][Spec Decode] Handle draft tokens beyond max_model_len by @WoosukKwon in #16087
[TPU][V1] Implicitly adjust page size when there's SMEM OOM by @yaochengji in #16871
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml by @mgoin in #16946
[TPU][V1] Capture multimodal encoder during model compilation by @NickLucche in #15051
[V1] V1 FlashInfer Attention by @mgoin in #16684
[TPU][V1] Enable Top-P by @NickLucche in #16843
[Doc] Remove unnecessary V1 flag by @DarkLight1337 in #16924
[BugFix][Spec Decode] No in-place update to draft probs by @WoosukKwon in #16952
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other by @jeffrey-dot-li in #16863
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 by @kliuae in #16727
[Misc] Remove the chunked prefill warning for LoRA by @jeejeelee in #16925
[Kernel] Add expert_map support to Cutlass FP8 MOE by @varun-sundar-rabindranath in #16861
[V1] Remove additional_config check by @wangxiyuan in #16710
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm by @charlifu in #15830
Support S3 Sharded loading with RunAI Model Streamer by @omer-dayan in #16317
[Bugfix] Fix f-string for Python 3.9-3.11 by @DarkLight1337 in #16962
[Doc] Update ai_accelerator/hpu-gaudi.inc.md by @windsonsea in #16956
[Perf] Optimize _update_states for GPU model runner by @SnowCharmQ in #16910
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams by @chaunceyjiang in #16767
[Model] Use autoweightloader for mamba by @sfeng33 in #16950
[V1] Remove pre-allocation for KV cache by @WoosukKwon in #16941
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS by @LeiWang1999 in #6036
[BugFix] Fix incremental detokenization perf issue by @njhill in #16963
[Doc] Improve documentation for multimodal CLI args by @DarkLight1337 in #16960
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER by @vllmellm in #15001
[Misc] refactor example series by @reidliu41 in #16972
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in #16974
Improve configs - SpeculativeConfig by @hmellor in #16971
[BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) by @timzsu in #16973
[Misc] Add S3 environment variables for better support of MinIO. by @chaunceyjiang in #16977
[frontend] enhance tool_calls type check by @reidliu41 in #16882
[FEAT][ROCm]: Support AITER MLA by @vllmellm in #15893
Add assertion for no objects while hashing hf_config by @zou3519 in #16930
Fencing Kernels Tests for enabling on AMD by @Alexei-V-Ivanov-AMD in #16929
[BugFix] Remove default multiproc executor collective_rpc timeout by @njhill in #17000
[Core][V1][TPU] Enable structured decoding on TPU V1 by @Chenyaaang in #16499
[Bugfix] validate urls object for multimodal content parts by @gcalmettes in #16990
add Dockerfile build vllm against torch nightly by @yangw-dev in #16936
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 by @maleksan85 in #13305
[V1][DP] More robust DP/EP dummy request coordination by @njhill in #16277
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check by @vllmellm in #17022
Revert "[Misc] Add S3 environment variables for better support of MinIO." by @chaunceyjiang in #17021
[misc] tune some env vars for GB200 by @youkaichao in #16992
[INTEL-HPU][v0] Port delayed sampling to upstream by @xuechendi in #16949
[doc] add download path tips by @reidliu41 in #17013
[Bugfix] Triton FA function takes no keyword arguments by @vllmellm in #16902
[V1] Avoid socket errors during shutdown when requests are in in-flight by @njhill in #16807
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in #16998
[Misc] Improve readability of get_open_port function. by @gitover22 in #17024
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers by @chaunceyjiang in #16964
[CI] Run v1/test_serial_utils.py in CI by @russellb in #16996
Mistral-format support for compressed-tensors by @mgoin in #16803
Categorize tests/kernels/ based on kernel type by @mgoin in #16799
[Doc] Add top anchor and a note to quantization/bitblas.md by @windsonsea in #17042
Ensure that pid passed to kill_process_tree is int for mypy by @hmellor in #17051
[CI] Update structured-output label automation by @russellb in #17055
Improve Transformers backend model loading QoL by @hmellor in #17039
CacheConfig.block_size should always be int when used by @hmellor in #17052
Use @property and private field for data_parallel_rank_local by @hmellor in #17053
[Frontend] Support guidance:no-additional-properties for compatibility with xgrammar by @tjohnson31415 in #15949
[BugFix][V1] Fix int32 token index overflow when preparing input ids by @sarckk in #16806
[V1][Spec Decode] Always use argmax for sampling draft tokens by @WoosukKwon in #16899
[CI/Build] workaround for CI build failure by @csy1204 in #17070
[Quantization]add prefix for commandA quantized model by @CXIAAAAA in #17017
[Minor] Use larger batch sizes for A100/B100/B200/MI300x by @WoosukKwon in #17073
[Bugfix] Enable V1 usage stats by @mgoin in #16986
More informative error when using Transformers backend by @hmellor in #16988
Addendum Fix to support FIPS enabled machines with MD5 hashing by @sydarb in #17043
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… by @zhangyuygss in #16472
[V1] Update structured output by @reidliu41 in #16812
[doc] update to hyperlink by @reidliu41 in #17096
Add docs for runai_streamer_sharded by @omer-dayan in #17093
[Chore] Remove Sampler from Model Code by @WoosukKwon in #17084
Disable enforce_eager for V1 TPU sampler and structured output tests by @mgoin in #17016
Simplify TokenizerGroup by @hmellor in #16790
Fix OOT registration test by @hmellor in #17099
[V1][PP] Optimization: continue scheduling prefill chunks by @ruisearch42 in #17080
[Misc] Remove OLMo2 config copy by @Isotr0py in #17066
Improve static type checking in LoRAModelRunnerMixin by @hmellor in #17104
[V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning by @shen-shanshan in #16954
[Frontend] Using matryoshka_dimensions control the allowed output dimensions. by @noooop in #16970
Add missing rocm_skinny_gemms kernel test to CI by @mgoin in #17060
[Misc] refactor example series - structured outputs by @reidliu41 in #17040
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics by @markmc in #16665
[CI] Add automation for the tool-calling github label by @russellb in #17118
Updating builkite job for IBM Power by @AaruniAggarwal in #17111
existing torch installation pip command fix for docs by @atilla00 in #17059
Molmo Requirements by @Eyshika in #17026
Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly by @hmellor in #17124
Improve configs - LoRAConfig + PromptAdapterConfig by @hmellor in #16980
[Docs] Generate correct github links for decorated functions by @russellb in #17125
Add collective_rpc to llm engine by @yinghai in #16999
Add chat template for Llama 4 models by @maxdebayser in #16428
[Misc] Add example to run DeepSeek with Ray Serve LLM by @ruisearch42 in #17134
Better error message for missing mistral params.json by @mgoin in #17132
Use custom address for listening socket by @jglaser in #15988
[FEAT] [ROCm]: AITER Fused MOE V1 Support by @vllmellm in #16752
[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 by @LucasWilkinson in #16864
fix float16 support for kimi-vl by @zhouzaida in #17156
[Doc] V1 : Update LoRA status by @varun-sundar-rabindranath in #17133
[Docs] Fix True->true in supported_models.md by @mgoin in #17141
Move missed SchedulerConfig args into scheduler config group in EngineArgs by @hmellor in #17131
[Misc] Clean up redundant code in uniproc_executor.py by @lifuhuang in #16762
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton by @MengqingCao in #15099
[Misc] Benchmark Serving Script Support Appending Results by @LucasWilkinson in #17028
[Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance by @cynthieye in #16457
[Bugfix] remove fallback in guided_json (int range, patterns) by @csy1204 in #16725
[Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization by @rasmith in #15734
[Doc] Add headings to improve gptqmodel.md by @windsonsea in #17164
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 by @houseroad in #17158
[Doc] Add two links to disagg_prefill.md by @windsonsea in #17168
[Doc] Move todo out of beam search docstring by @alex-jw-brooks in #17183
[Bugfix] Fix mistral model tests by @DarkLight1337 in #17181
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception by @JasmondL in #16769
Bump Transformers to 4.51.3 by @hmellor in #17116
Use Transformers helper get_text_config() instead of checking for text_config by @hmellor in #17105
[doc] update wrong hf model links by @reidliu41 in #17184
[Misc] Inline Molmo requirements by @DarkLight1337 in #17190
[Security] Use safe serialization and fix zmq setup for mooncake pipe by @russellb in #17192
[V1] Move usage stats to worker and start logging TPU hardware by @dyli-google in #16211
[Bugfix] Fix hybrid model tests by @DarkLight1337 in #17182
Fix Python packaging edge cases by @tiran in #17159
[BugFix][Frontend] Fix LLM.chat() tokenization by @njhill in #16081
[V1][Spec Decode] EAGLE-3 Support by @benchislett in #16937
[Misc] Refine ray_serve_deepseek example by @ruisearch42 in #17204
[Bugfix] gemma[2,3] interleaved attention when sliding window is disabled by @heheda12345 in #17180
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary by @rasmith in #17215
[v1] [P/D] Adding LMCache KV connector for v1 by @ApostaC in #16625
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env by @jamesjwu in #17142
[MISC][AMD] Add unused annotation to rocm kernel file by @houseroad in #17097
[doc] add Anything LLM integration by @reidliu41 in #17216
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig by @WoosukKwon in #17213
[Doc] Minor fix for the vLLM TPU setup page by @yarongmu-google in #17206
[Minor][Models] Fix Return Types of Llama & Eagle by @WoosukKwon in #17220
Allocate kv_cache with stride order by @wenscarl in #16605
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. by @charlifu in #17011
[V1][Metrics] Allow V1 AsyncLLM to use custom logger by @liuzijing2014 in #14661
[BugFix] Avoid race conditions in zero-copy tensor transmission by @njhill in #17203
[CI/test] Fix Eagle Correctness Test by @WoosukKwon in #17209
[Core] Remove prompt string from engine core data structures by @njhill in #17214
[Bugfix] Fix missing int type for -n in multi-image example by @Isotr0py in #17223
[Bugfix] Fix standard models tests by @DarkLight1337 in #17217
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device by @adobrzyn in #17186
[V1] Add structural_tag support using xgrammar by @russellb in #17085
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set by @andyxning in #17088
[Chore] added stubs for vllm_flash_attn during development mode by @aarnphm in #17228
[Docs] Update structured output doc for V1 by @russellb in #17135
[Bugfix] fix error due to an uninitialized tokenizer when using skip_tokenizer_init with num_scheduler_steps by @junstar92 in #9276
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 by @houseroad in #16573
[MISC] rename interval to max_recent_requests by @andyxning in #14285
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation by @imkero in #16878
[Minor] Fix lint error in main branch by @WoosukKwon in #17233
[CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh by @reidliu41 in #16271
Update test_flash_attn.py by @ShuaibinLi in #17102
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel by @rasmith in #12591
[Misc] Make cached tokenizer pickle-compatible by @DarkLight1337 in #17048
[Bugfix] Fix QWen2 VL multimodal mapping by @jeejeelee in #17240
[Bugfix] Get a specific type of layer from forward context by @heheda12345 in #17222
[MISC] Use string annotation types for class definitions by @jianzs in #17244
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens by @sfc-gh-zhwang in #17033
[Bugfix] Fix Lora Name Parsing by @alex-jw-brooks in #17196
[NVIDIA] Support Cutlass MLA for Blackwell GPUs by @kaixih in #16032
[Feature] support sequence parallelism using compilation pass by @cascade812 in #16155
[doc] Add feature status legend by @reidliu41 in #17257
[Metrics] Fix minor inconsistencies in bucket progression by @DarkLight1337 in #17262
[V1][Spec Decode] Make eagle compatible with prefix caching. by @LiuXiaoxuanPKU in #17137
[BugFix] Fix vllm_flash_attn install issues by @LucasWilkinson in #17267
[Bugfix] Fix missing ARG in Dockerfile for arm64 platforms by @lkm-schulz in #17261
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… by @Ther-LF in #16751
[Bugfix] Fix Mistral3 spatial merge error by @mgoin in #17270
[Doc] Fix wrong github link in LMCache examples by @KuntaiDu in #17274
[Doc] small fix by @reidliu41 in #17277
[Misc] Validate stop_token_ids contents by @njhill in #17268
[Minor][Models] Pass partial_rotary_factor parameter to rope by @Eviannn in #17266
[Core] Remove legacy input mapper/processor from V0 by @DarkLight1337 in #15686
[Model] Add Granite Speech Support by @alex-jw-brooks in #16246
Update tpu_worker.py 's typo by @idouba in #17288
Add missing class docstring for PromptAdapterConfig by @hmellor in #17302
[Bugfix] Add missing get_language_model to new MLLMs by @DarkLight1337 in #17300
[doc] update wrong model id by @reidliu41 in #17287
[Misc] Minor typo/grammar in platforms/interface.py by @NickLucche in #17307
[Misc] Clean up Qwen2.5-Omni code by @DarkLight1337 in #17301
[Docs] Add a security guide by @russellb in #17230
Improve conversion from dataclass configs to argparse arguments by @hmellor in #17303
Make name of compressed-tensors quant method consistent across vLLM by @hmellor in #17255
Explicitly explain quant method override ordering and ensure all overrides are ordered by @hmellor in #17256
[Security] Don't bind tcp zmq socket to all interfaces by @russellb in #17197
[Chore] cleanup license indicators in light of SPDX by @aarnphm in #17259
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in #17283
[Bugfix] Fix moe weight losing all extra attrs after process_weights_after_loading. by @charlifu in #16854
[Model] Qwen3 Dense FP8 Compat Fixes by @simon-mo in #17318

New Contributors

@Nash-123 made their first contribution in #16036
@celestialli made their first contribution in #16562
@taneem-ibrahim made their first contribution in #16549
@Cangxihui made their first contribution in #16572
@angkywilliam made their first contribution in #10546
@Alnusjaponica made their first contribution in #14323
@xsank made their first contribution in #16648
@jianzs made their first contribution in #16703
@p88h made their first contribution in #16432
@AaruniAggarwal made their first contribution in #16679
@davidheineman made their first contribution in #16741
@richardliaw made their first contribution in #16609
@intervitens made their first contribution in #16618
@windsonsea made their first contribution in #16773
@insukim1994 made their first contribution in #16784
@Ximingwang-09 made their first contribution in #16753
@sijiac made their first contribution in #16674
@tarukumar made their first contribution in #16721
@nathan-weinberg made their first contribution in #16796
@jmho made their first contribution in #16857
@vie-serendipity made their first contribution in #16460
@zixi-qi made their first contribution in #16895
@fingertap made their first contribution in #16889
@kartikx made their first contribution in #16737
@davidxia made their first contribution in #16888
@chanh made their first contribution in #16436
@jeffrey-dot-li made their first contribution in #16863
@sfeng33 made their first contribution in #16950
@LeiWang1999 made their first contribution in #6036
@timzsu made their first contribution in #16973
@yangw-dev made their first contribution in #16936
@gitover22 made their first contribution in #17024
@csy1204 made their first contribution in #17070
@sydarb made their first contribution in #17043
@zhangyuygss made their first contribution in #16472
@atilla00 made their first contribution in #17059
@Eyshika made their first contribution in #17026
@yinghai made their first contribution in #16999
@jglaser made their first contribution in #15988
@zhouzaida made their first contribution in #17156
@lifuhuang made their first contribution in #16762
@JasmondL made their first contribution in #16769
@tiran made their first contribution in #17159
@jamesjwu made their first contribution in #17142
@wenscarl made their first contribution in #16605
@liuzijing2014 made their first contribution in #14661
@adobrzyn made their first contribution in #17186
@andyxning made their first contribution in #17088
@junstar92 made their first contribution in #9276
@ShuaibinLi made their first contribution in #17102
@cascade812 made their first contribution in #16155
@lkm-schulz made their first contribution in #17261
@Ther-LF made their first contribution in #16751
@Eviannn made their first contribution in #17266
@idouba made their first contribution in #17288

Full Changelog: v0.8.4...v0.8.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.8.5

Highlights

Model Support

V1 Engine

Features

Performance

Hardwares

Documentation

Security and Dependency Updates

Build and testing

Breaking changes 🚨

What's Changed

New Contributors

Contributors

Uh oh!