v0.8.5
This release contains 310 commits from 143 contributors (55 new contributors!).
Highlights
This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.
Model Support
- Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
- Add ModernBERT (#16648)
- Add Granite Speech Support (#16246)
- Add PLaMo2 (#14323)
- Add Kimi-VL model support (#16387)
- Add Qwen2.5-Omni model support (thinker only) (#15130)
- Snowflake Arctic Embed (Family) (#16649)
- Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)
V1 Engine
- Add
structural_tag
support using xgrammar (#17085) - Disaggregated serving:
- Clean up: Remove Sampler from Model Code (#17084)
- MLA: Simplification to batch P/D reordering (#16673)
- Move usage stats to worker and start logging TPU hardware (#16211)
- Support FlashInfer Attention (#16684)
- Faster incremental detokenization (#15137)
- EAGLE-3 Support (#16937)
Features
- Validate urls object for multimodal content parts (#16990)
- Prototype support sequence parallelism using compilation pass (#16155)
- Add sampling params to
v1/audio/transcriptions
endpoint (#16591) - Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
- Add
vllm bench [latency, throughput]
CLI commands (#16508)
Performance
- Attention:
- MoE:
- Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
- Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)
Hardwares
- TPU:
- AMD:
- AITER Fused MOE V1 Support (#16752)
- Integrate Paged Attention Kernel from AITER (#15001)
- Support AITER MLA (#15893)
- Upstream prefix prefill speed up for vLLM V1 (#13305)
- Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
- Add skinny gemms for unquantized linear on ROCm (#15830)
- Follow-ups for Skinny Gemms on ROCm. (#17011)
Documentation
- Add open-webui example (#16747)
- Document Matryoshka Representation Learning support (#16770)
- Add a security guide (#17230)
- Add example to run DeepSeek with Ray Serve LLM (#17134)
- Benchmarks for audio models (#16505)
Security and Dependency Updates
- Don't bind tcp zmq socket to all interfaces (#17197)
- Use safe serialization and fix zmq setup for mooncake pipe (#17192)
- Bump Transformers to 4.51.3 (#17116)
Build and testing
- Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)
Breaking changes 🚨
--enable-chunked-prefill
,--multi-step-stream-outputs
,--disable-chunked-mm-input
can no longer explicitly be set toFalse
. Instead, addno-
to the start of the argument (i.e.--enable-chunked-prefill
and--no-enable-chunked-prefill
) (#16533)
What's Changed
- Improve configs -
SchedulerConfig
by @hmellor in #16533 - [Misc] remove warning if triton>=3.2.0 by @DefTruth in #16553
- [Misc] refactor examples by @reidliu41 in #16563
- [Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in #16523
- [fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in #16048
- [Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in #16593
- [TPU][V1] Fix exponential padding when
max-num-batched-tokens
is not a power of 2 by @NickLucche in #16596 - Fix triton install condition on CPU by @hmellor in #16600
- s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in #16036
- [Model][VLM] Add Kimi-VL model support by @courage17340 in #16387
- [Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in #16616
- [DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in #16614
- config check sleep mode support oot platforms by @celestialli in #16562
- [Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in #16390
- [Kernel] moe wna16 marlin kernel by @jinzhen-lin in #14447
- [BugFix]: Update minimum
pyzmq
version by @taneem-ibrahim in #16549 - [Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in #16623
- [Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in #16631
- Add
vllm bench [latency, throughput]
CLI commands by @mgoin in #16508 - Fix vLLM x torch.compile config caching by @zou3519 in #16491
- [Misc] refactor argument parsing in examples by @reidliu41 in #16635
- [CI/Build] Fix LoRA OOM by @jeejeelee in #16624
- Add "/server_info" endpoint in api_server to retrieve the vllm_config. by @Cangxihui in #16572
- [Kernel] Remove redundant Exp calculations by @DefTruth in #16123
- [Misc] Update
compressed-tensors
WNA16 to support zero-points by @dsikka in #14211 - [Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in #10546
- [Model] Add PLaMo2 by @Alnusjaponica in #14323
- [Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in #16628
- [Misc] Modify LRUCache touch by @jeejeelee in #16689
- Disable remote caching when calling compile_fx by @zou3519 in #16611
- [Feature] add model aware kv ops helper by @billishyahao in #16020
- [ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in #16664
- [V1][Structured Output] Move xgrammar related utils to
backend_xgrammar.py
by @shen-shanshan in #16578 - [CI] Cleanup
additional_dependencies: [toml]
for pre-commit yapf hook by @yankay in #16405 - [Misc] refactor examples series by @reidliu41 in #16708
- [Doc] Improve OOM troubleshooting by @DarkLight1337 in #16704
- [Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in #16693
- [Model] support modernbert by @xsank in #16648
- [Hardware] Add processor inputs to platform validation by @joerunde in #16680
- Improve error for structured output backend selection by @hmellor in #16717
- [Misc] Remove redundant comment by @jianzs in #16703
- Help user create custom model for Transformers backend remote code models by @hmellor in #16719
- [V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in #16432
- [V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in #16636
- Adding vllm buildkite job for IBM Power by @AaruniAggarwal in #16679
- [V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in #11737
- [rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in #16426
- [Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in #16734
- [Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in #16741
- [V1] Remove log noise when idle by @russellb in #16735
- [Ray] Improve documentation on batch inference by @richardliaw in #16609
- [misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in #16760
- [Doc] Add more tips to avoid OOM by @DarkLight1337 in #16765
- [doc] add open-webui example by @reidliu41 in #16747
- [Bugfix] Fix GLM4 model by @intervitens in #16618
- [Doc] Fix a 404 link in installation/cpu.md by @windsonsea in #16773
- [Misc] refactor examples series - lmcache by @reidliu41 in #16758
- Improve configs -
TokenizerPoolConfig
+DeviceConfig
by @hmellor in #16603 - fix: hyperlink by @reidliu41 in #16778
- [Doc] Make sure to update vLLM when installing latest code by @DarkLight1337 in #16781
- [Doc] Document Matryoshka Representation Learning support by @noooop in #16770
- [Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion by @insukim1994 in #16784
- [V1][Perf] Faster incremental detokenization by @njhill in #15137
- [Bugfix]Fix index out of range error in api server log by @WangErXiao in #16787
- [Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 by @Ximingwang-09 in #16753
- [Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small by @lengrongfu in #16548
- [TPU][V1] Fix padding recompilation when
max-num-batched-tokens
is not even by @NickLucche in #16726 - [V1][TPU] Enable Top K by @NickLucche in #15489
- [ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints by @sijiac in #16674
- [V1][Metrics] Fix http metrics middleware by @markmc in #15894
- [MLA] Simplification to batch P/D reordering by @njhill in #16673
- [P/D][V1] KV Connector API V1 by @ApostaC in #15960
- [Attention] Update to lastest FA3 code by @LucasWilkinson in #13111
- Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema by @tarukumar in #16721
- [Doc] Improve help examples for
--compilation-config
by @DarkLight1337 in #16729 - [Misc] Update outdated note: LMCache now supports chunked prefill by @chaunceyjiang in #16697
- [V1][Structured Output] Minor modification to
_validate_structured_output()
by @shen-shanshan in #16748 - Add hardware print to TPU V1 test by @mgoin in #16792
- [BugFix] Accuracy fix for llama4 int4 - improperly casted scales by @LucasWilkinson in #16801
- Improve configs -
MultiModalConfig
+PoolerConfig
+DecodingConfig
by @hmellor in #16789 - [Misc] add collect_env to cli and docker image by @lengrongfu in #16759
- [ROCm] [Attention] Cleanup ROCm output passing by @ProExpertProg in #16431
- [Bugfix] fix pp for llama4 by @luccafong in #16746
- [Doc] add podman setup instructions for official image by @nathan-weinberg in #16796
- [Docs] Fix a link and grammar issue in production-stack.md by @windsonsea in #16809
- [Model] use AutoWeightsLoader for BigCode, GPT-J by @jonghyunchoe in #16823
- [Misc] Clean up Kimi-VL by @DarkLight1337 in #16833
- Fix
nullable_kvs
fallback by @hmellor in #16837 - [New Model]: Snowflake Arctic Embed (Family) by @noooop in #16649
- [Misc] refactor examples series - Chat Completion Client With Tools by @reidliu41 in #16829
- [Doc] Updated Llama section in tool calling docs to have llama 3.2 config info by @jmho in #16857
- publish neuron docker image by @omrishiv in #16733
- [Model][VLM] Add Qwen2.5-Omni model support (thinker only) by @fyabc in #15130
- [rocm][MI300] llama4 maverick fp8 moe config tp8 by @divakar-amd in #16847
- [Frontend] Add sampling params to
v1/audio/transcriptions
endpoint by @NickLucche in #16591 - [Misc] Benchmarks for audio models by @NickLucche in #16505
- [V1][Misc] stop update prefix cache stats when logs_stats is disabled by @vie-serendipity in #16460
- [Model] Refactor Phi-4-multimodal to use merged processor and support V1 by @Isotr0py in #15477
- [Model] Qwen2.5-Omni Cleanup by @ywang96 in #16872
- [VLM] Clean up models by @DarkLight1337 in #16873
- [doc] update hyperlink by @reidliu41 in #16877
- Log how much time loading a compiled artifact takes by @zou3519 in #16848
- Serialize tensors using int8 views by @p88h in #16866
- Improve configs -
CacheConfig
by @hmellor in #16835 - [easy] Pass compile_fx only the config patches by @zou3519 in #16845
- [Bugfix] Fix v1/spec_decode/test_ngram.py by @zixi-qi in #16895
- [CI/CD][V1] Add spec decode tests to CI by @WoosukKwon in #16900
- [Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in #16907
- [Doc] Split dummy_processor_inputs() in Multimodal Docs by @alex-jw-brooks in #16915
- Restore buffers when wake up from level 2 sleep (#16564) by @fingertap in #16889
- [Misc] fix collect_env version parse by @wangxiyuan in #15267
- [Misc] Refactor platform to get device specific stream and event by @shen-shanshan in #14411
- [Bugfix] Fix GLM rotary_dim issue and support v1 by @Isotr0py in #16912
- Raise error for data-parallel with benchmark_throughput by @kartikx in #16737
- [XPU][Bugfix] minor fix for XPU by @yma11 in #15591
- [doc] install required python3-dev apt package by @davidxia in #16888
- [Doc] mention how to install in CPU editable mode by @davidxia in #16923
- [Core] Speed up decode by remove synchronizing operation in sampler by @chanh in #16436
- [V1][Spec Decode] Handle draft tokens beyond max_model_len by @WoosukKwon in #16087
- [TPU][V1] Implicitly adjust page size when there's SMEM OOM by @yaochengji in #16871
- Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml by @mgoin in #16946
- [TPU][V1] Capture multimodal encoder during model compilation by @NickLucche in #15051
- [V1] V1 FlashInfer Attention by @mgoin in #16684
- [TPU][V1] Enable Top-P by @NickLucche in #16843
- [Doc] Remove unnecessary V1 flag by @DarkLight1337 in #16924
- [BugFix][Spec Decode] No in-place update to draft probs by @WoosukKwon in #16952
- [Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other by @jeffrey-dot-li in #16863
- [ROCm] Add aiter tkw1 kernel for Llama4 fp8 by @kliuae in #16727
- [Misc] Remove the chunked prefill warning for LoRA by @jeejeelee in #16925
- [Kernel] Add expert_map support to Cutlass FP8 MOE by @varun-sundar-rabindranath in #16861
- [V1] Remove additional_config check by @wangxiyuan in #16710
- [Performance][ROCm] Add skinny gemms for unquantized linear on ROCm by @charlifu in #15830
- Support S3 Sharded loading with RunAI Model Streamer by @omer-dayan in #16317
- [Bugfix] Fix f-string for Python 3.9-3.11 by @DarkLight1337 in #16962
- [Doc] Update ai_accelerator/hpu-gaudi.inc.md by @windsonsea in #16956
- [Perf] Optimize
_update_states
for GPU model runner by @SnowCharmQ in #16910 - [Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams by @chaunceyjiang in #16767
- [Model] Use autoweightloader for mamba by @sfeng33 in #16950
- [V1] Remove pre-allocation for KV cache by @WoosukKwon in #16941
- [Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS by @LeiWang1999 in #6036
- [BugFix] Fix incremental detokenization perf issue by @njhill in #16963
- [Doc] Improve documentation for multimodal CLI args by @DarkLight1337 in #16960
- [FEAT][ROCm] Integrate Paged Attention Kernel from AITER by @vllmellm in #15001
- [Misc] refactor example series by @reidliu41 in #16972
- [Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in #16974
- Improve configs -
SpeculativeConfig
by @hmellor in #16971 - [BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) by @timzsu in #16973
- [Misc] Add S3 environment variables for better support of MinIO. by @chaunceyjiang in #16977
- [frontend] enhance tool_calls type check by @reidliu41 in #16882
- [FEAT][ROCm]: Support AITER MLA by @vllmellm in #15893
- Add assertion for no objects while hashing hf_config by @zou3519 in #16930
- Fencing Kernels Tests for enabling on AMD by @Alexei-V-Ivanov-AMD in #16929
- [BugFix] Remove default multiproc executor
collective_rpc
timeout by @njhill in #17000 - [Core][V1][TPU] Enable structured decoding on TPU V1 by @Chenyaaang in #16499
- [Bugfix] validate urls object for multimodal content parts by @gcalmettes in #16990
- add Dockerfile build vllm against torch nightly by @yangw-dev in #16936
- [Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 by @maleksan85 in #13305
- [V1][DP] More robust DP/EP dummy request coordination by @njhill in #16277
- [BugFix] Revert ROCm Custom Paged Attention Env Flag Check by @vllmellm in #17022
- Revert "[Misc] Add S3 environment variables for better support of MinIO." by @chaunceyjiang in #17021
- [misc] tune some env vars for GB200 by @youkaichao in #16992
- [INTEL-HPU][v0] Port delayed sampling to upstream by @xuechendi in #16949
- [doc] add download path tips by @reidliu41 in #17013
- [Bugfix] Triton FA function takes no keyword arguments by @vllmellm in #16902
- [V1] Avoid socket errors during shutdown when requests are in in-flight by @njhill in #16807
- [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in #16998
- [Misc] Improve readability of get_open_port function. by @gitover22 in #17024
- [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers by @chaunceyjiang in #16964
- [CI] Run v1/test_serial_utils.py in CI by @russellb in #16996
- Mistral-format support for compressed-tensors by @mgoin in #16803
- Categorize
tests/kernels/
based on kernel type by @mgoin in #16799 - [Doc] Add top anchor and a note to quantization/bitblas.md by @windsonsea in #17042
- Ensure that
pid
passed tokill_process_tree
isint
formypy
by @hmellor in #17051 - [CI] Update structured-output label automation by @russellb in #17055
- Improve Transformers backend model loading QoL by @hmellor in #17039
CacheConfig.block_size
should always beint
when used by @hmellor in #17052- Use
@property
and private field fordata_parallel_rank_local
by @hmellor in #17053 - [Frontend] Support guidance:no-additional-properties for compatibility with xgrammar by @tjohnson31415 in #15949
- [BugFix][V1] Fix int32 token index overflow when preparing input ids by @sarckk in #16806
- [V1][Spec Decode] Always use argmax for sampling draft tokens by @WoosukKwon in #16899
- [CI/Build] workaround for CI build failure by @csy1204 in #17070
- [Quantization]add prefix for commandA quantized model by @CXIAAAAA in #17017
- [Minor] Use larger batch sizes for A100/B100/B200/MI300x by @WoosukKwon in #17073
- [Bugfix] Enable V1 usage stats by @mgoin in #16986
- More informative error when using Transformers backend by @hmellor in #16988
- Addendum Fix to support FIPS enabled machines with MD5 hashing by @sydarb in #17043
- [Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… by @zhangyuygss in #16472
- [V1] Update structured output by @reidliu41 in #16812
- [doc] update to hyperlink by @reidliu41 in #17096
- Add docs for runai_streamer_sharded by @omer-dayan in #17093
- [Chore] Remove Sampler from Model Code by @WoosukKwon in #17084
- Disable enforce_eager for V1 TPU sampler and structured output tests by @mgoin in #17016
- Simplify
TokenizerGroup
by @hmellor in #16790 - Fix OOT registration test by @hmellor in #17099
- [V1][PP] Optimization: continue scheduling prefill chunks by @ruisearch42 in #17080
- [Misc] Remove OLMo2 config copy by @Isotr0py in #17066
- Improve static type checking in
LoRAModelRunnerMixin
by @hmellor in #17104 - [V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning by @shen-shanshan in #16954
- [Frontend] Using matryoshka_dimensions control the allowed output dimensions. by @noooop in #16970
- Add missing rocm_skinny_gemms kernel test to CI by @mgoin in #17060
- [Misc] refactor example series - structured outputs by @reidliu41 in #17040
- [V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics by @markmc in #16665
- [CI] Add automation for the
tool-calling
github label by @russellb in #17118 - Updating builkite job for IBM Power by @AaruniAggarwal in #17111
- existing torch installation pip command fix for docs by @atilla00 in #17059
- Molmo Requirements by @Eyshika in #17026
- Add
:markdownhelp:
toEngineArgs
docs so markdown docstrings render properly by @hmellor in #17124 - Improve configs -
LoRAConfig
+PromptAdapterConfig
by @hmellor in #16980 - [Docs] Generate correct github links for decorated functions by @russellb in #17125
- Add collective_rpc to llm engine by @yinghai in #16999
- Add chat template for Llama 4 models by @maxdebayser in #16428
- [Misc] Add example to run DeepSeek with Ray Serve LLM by @ruisearch42 in #17134
- Better error message for missing mistral params.json by @mgoin in #17132
- Use custom address for listening socket by @jglaser in #15988
- [FEAT] [ROCm]: AITER Fused MOE V1 Support by @vllmellm in #16752
- [Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 by @LucasWilkinson in #16864
- fix float16 support for kimi-vl by @zhouzaida in #17156
- [Doc] V1 : Update LoRA status by @varun-sundar-rabindranath in #17133
- [Docs] Fix True->true in supported_models.md by @mgoin in #17141
- Move missed
SchedulerConfig
args into scheduler config group inEngineArgs
by @hmellor in #17131 - [Misc] Clean up redundant code in uniproc_executor.py by @lifuhuang in #16762
- [Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton by @MengqingCao in #15099
- [Misc] Benchmark Serving Script Support Appending Results by @LucasWilkinson in #17028
- [Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance by @cynthieye in #16457
- [Bugfix] remove fallback in guided_json (int range, patterns) by @csy1204 in #16725
- [Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization by @rasmith in #15734
- [Doc] Add headings to improve gptqmodel.md by @windsonsea in #17164
- Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 by @houseroad in #17158
- [Doc] Add two links to disagg_prefill.md by @windsonsea in #17168
- [Doc] Move todo out of beam search docstring by @alex-jw-brooks in #17183
- [Bugfix] Fix mistral model tests by @DarkLight1337 in #17181
- [Bugfix] Fix Mistral ChatCompletionRequest Body Exception by @JasmondL in #16769
- Bump Transformers to 4.51.3 by @hmellor in #17116
- Use Transformers helper
get_text_config()
instead of checking fortext_config
by @hmellor in #17105 - [doc] update wrong hf model links by @reidliu41 in #17184
- [Misc] Inline Molmo requirements by @DarkLight1337 in #17190
- [Security] Use safe serialization and fix zmq setup for mooncake pipe by @russellb in #17192
- [V1] Move usage stats to worker and start logging TPU hardware by @dyli-google in #16211
- [Bugfix] Fix hybrid model tests by @DarkLight1337 in #17182
- Fix Python packaging edge cases by @tiran in #17159
- [BugFix][Frontend] Fix
LLM.chat()
tokenization by @njhill in #16081 - [V1][Spec Decode] EAGLE-3 Support by @benchislett in #16937
- [Misc] Refine ray_serve_deepseek example by @ruisearch42 in #17204
- [Bugfix] gemma[2,3] interleaved attention when sliding window is disabled by @heheda12345 in #17180
- [AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary by @rasmith in #17215
- [v1] [P/D] Adding LMCache KV connector for v1 by @ApostaC in #16625
- [Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env by @jamesjwu in #17142
- [MISC][AMD] Add unused annotation to rocm kernel file by @houseroad in #17097
- [doc] add Anything LLM integration by @reidliu41 in #17216
- [Minor][Spec Decode] Add use_eagle to SpeculativeConfig by @WoosukKwon in #17213
- [Doc] Minor fix for the vLLM TPU setup page by @yarongmu-google in #17206
- [Minor][Models] Fix Return Types of Llama & Eagle by @WoosukKwon in #17220
- Allocate kv_cache with stride order by @wenscarl in #16605
- [ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. by @charlifu in #17011
- [V1][Metrics] Allow V1 AsyncLLM to use custom logger by @liuzijing2014 in #14661
- [BugFix] Avoid race conditions in zero-copy tensor transmission by @njhill in #17203
- [CI/test] Fix Eagle Correctness Test by @WoosukKwon in #17209
- [Core] Remove prompt string from engine core data structures by @njhill in #17214
- [Bugfix] Fix missing int type for
-n
in multi-image example by @Isotr0py in #17223 - [Bugfix] Fix standard models tests by @DarkLight1337 in #17217
- [Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device by @adobrzyn in #17186
- [V1] Add
structural_tag
support using xgrammar by @russellb in #17085 - [BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set by @andyxning in #17088
- [Chore] added stubs for
vllm_flash_attn
during development mode by @aarnphm in #17228 - [Docs] Update structured output doc for V1 by @russellb in #17135
- [Bugfix] fix error due to an uninitialized tokenizer when using
skip_tokenizer_init
withnum_scheduler_steps
by @junstar92 in #9276 - Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 by @houseroad in #16573
- [MISC] rename interval to max_recent_requests by @andyxning in #14285
- [Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation by @imkero in #16878
- [Minor] Fix lint error in main branch by @WoosukKwon in #17233
- [CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh by @reidliu41 in #16271
- Update test_flash_attn.py by @ShuaibinLi in #17102
- [Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel by @rasmith in #12591
- [Misc] Make cached tokenizer pickle-compatible by @DarkLight1337 in #17048
- [Bugfix] Fix QWen2 VL multimodal mapping by @jeejeelee in #17240
- [Bugfix] Get a specific type of layer from forward context by @heheda12345 in #17222
- [MISC] Use string annotation types for class definitions by @jianzs in #17244
- [Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens by @sfc-gh-zhwang in #17033
- [Bugfix] Fix Lora Name Parsing by @alex-jw-brooks in #17196
- [NVIDIA] Support Cutlass MLA for Blackwell GPUs by @kaixih in #16032
- [Feature] support sequence parallelism using compilation pass by @cascade812 in #16155
- [doc] Add feature status legend by @reidliu41 in #17257
- [Metrics] Fix minor inconsistencies in bucket progression by @DarkLight1337 in #17262
- [V1][Spec Decode] Make eagle compatible with prefix caching. by @LiuXiaoxuanPKU in #17137
- [BugFix] Fix vllm_flash_attn install issues by @LucasWilkinson in #17267
- [Bugfix] Fix missing ARG in Dockerfile for arm64 platforms by @lkm-schulz in #17261
- [Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… by @Ther-LF in #16751
- [Bugfix] Fix Mistral3 spatial merge error by @mgoin in #17270
- [Doc] Fix wrong github link in LMCache examples by @KuntaiDu in #17274
- [Doc] small fix by @reidliu41 in #17277
- [Misc] Validate
stop_token_ids
contents by @njhill in #17268 - [Minor][Models] Pass partial_rotary_factor parameter to rope by @Eviannn in #17266
- [Core] Remove legacy input mapper/processor from V0 by @DarkLight1337 in #15686
- [Model] Add Granite Speech Support by @alex-jw-brooks in #16246
- Update tpu_worker.py 's typo by @idouba in #17288
- Add missing class docstring for
PromptAdapterConfig
by @hmellor in #17302 - [Bugfix] Add missing
get_language_model
to new MLLMs by @DarkLight1337 in #17300 - [doc] update wrong model id by @reidliu41 in #17287
- [Misc] Minor typo/grammar in
platforms/interface.py
by @NickLucche in #17307 - [Misc] Clean up Qwen2.5-Omni code by @DarkLight1337 in #17301
- [Docs] Add a security guide by @russellb in #17230
- Improve conversion from dataclass configs to argparse arguments by @hmellor in #17303
- Make name of
compressed-tensors
quant method consistent across vLLM by @hmellor in #17255 - Explicitly explain quant method override ordering and ensure all overrides are ordered by @hmellor in #17256
- [Security] Don't bind tcp zmq socket to all interfaces by @russellb in #17197
- [Chore] cleanup license indicators in light of SPDX by @aarnphm in #17259
- [BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in #17283
- [Bugfix] Fix moe weight losing all extra attrs after
process_weights_after_loading
. by @charlifu in #16854 - [Model] Qwen3 Dense FP8 Compat Fixes by @simon-mo in #17318
New Contributors
- @Nash-123 made their first contribution in #16036
- @celestialli made their first contribution in #16562
- @taneem-ibrahim made their first contribution in #16549
- @Cangxihui made their first contribution in #16572
- @angkywilliam made their first contribution in #10546
- @Alnusjaponica made their first contribution in #14323
- @xsank made their first contribution in #16648
- @jianzs made their first contribution in #16703
- @p88h made their first contribution in #16432
- @AaruniAggarwal made their first contribution in #16679
- @davidheineman made their first contribution in #16741
- @richardliaw made their first contribution in #16609
- @intervitens made their first contribution in #16618
- @windsonsea made their first contribution in #16773
- @insukim1994 made their first contribution in #16784
- @Ximingwang-09 made their first contribution in #16753
- @sijiac made their first contribution in #16674
- @tarukumar made their first contribution in #16721
- @nathan-weinberg made their first contribution in #16796
- @jmho made their first contribution in #16857
- @vie-serendipity made their first contribution in #16460
- @zixi-qi made their first contribution in #16895
- @fingertap made their first contribution in #16889
- @kartikx made their first contribution in #16737
- @davidxia made their first contribution in #16888
- @chanh made their first contribution in #16436
- @jeffrey-dot-li made their first contribution in #16863
- @sfeng33 made their first contribution in #16950
- @LeiWang1999 made their first contribution in #6036
- @timzsu made their first contribution in #16973
- @yangw-dev made their first contribution in #16936
- @gitover22 made their first contribution in #17024
- @csy1204 made their first contribution in #17070
- @sydarb made their first contribution in #17043
- @zhangyuygss made their first contribution in #16472
- @atilla00 made their first contribution in #17059
- @Eyshika made their first contribution in #17026
- @yinghai made their first contribution in #16999
- @jglaser made their first contribution in #15988
- @zhouzaida made their first contribution in #17156
- @lifuhuang made their first contribution in #16762
- @JasmondL made their first contribution in #16769
- @tiran made their first contribution in #17159
- @jamesjwu made their first contribution in #17142
- @wenscarl made their first contribution in #16605
- @liuzijing2014 made their first contribution in #14661
- @adobrzyn made their first contribution in #17186
- @andyxning made their first contribution in #17088
- @junstar92 made their first contribution in #9276
- @ShuaibinLi made their first contribution in #17102
- @cascade812 made their first contribution in #16155
- @lkm-schulz made their first contribution in #17261
- @Ther-LF made their first contribution in #16751
- @Eviannn made their first contribution in #17266
- @idouba made their first contribution in #17288
Full Changelog: v0.8.4...v0.8.5