Update dependency vllm to v0.7.2 [SECURITY] #16

renovate · 2025-01-29T04:00:10Z

This PR contains the following updates:

Package	Change	Age	Adoption	Passing	Confidence
vllm	`==0.6.1` -> `==0.7.2`

GitHub Vulnerability Alerts

CVE-2025-24357

Description

The vllm/model_executor/weight_utils.py implements hf_model_weights_iterator to load the model checkpoint, which is downloaded from huggingface. It use torch.load function and weights_only parameter is default value False. There is a security warning on https://pytorch.org/docs/stable/generated/torch.load.html, when torch.load load a malicious pickle data it will execute arbitrary code during unpickling.

Impact

This vulnerability can be exploited to execute arbitrary codes and OS commands in the victim machine who fetch the pretrained repo remotely.

Note that most models now use the safetensors format, which is not vulnerable to this issue.

References

CVE-2025-25183

Summary

Maliciously constructed prompts can lead to hash collisions, resulting in prefix cache reuse, which can interfere with subsequent responses and cause unintended behavior.

Details

vLLM's prefix caching makes use of Python's built-in hash() function. As of Python 3.12, the behavior of hash(None) has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions.

Impact

The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use.

Solution

We address this problem by initializing hashes in vllm with a value that is no longer constant and predictable. It will be different each time vllm runs. This restores behavior we got in Python versions prior to 3.12.

Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. Hash collisions may still occur, though they are no longer straight forward to predict.

To give an idea of the likelihood of a collision, for randomly generated hash values (assuming the hash generation built into Python is uniformly distributed), with a cache capacity of 50,000 messages and an average prompt length of 300, a collision will occur on average once every 1 trillion requests.

References

Release Notes

vllm-project/vllm (vllm)

`v0.7.2`

Compare Source

Highlights

Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face transformers library at the moment (#12604)
Add transformers backend support via --model-impl=transformers. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727).
Performance enhancement to DeepSeek models.
- Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676)
- Apply torch.compile to fused_moe/grouped_topk, yielding 5% throughput enhancement (#12637)
- Enable MLA for DeepSeek VL2 (#12729)
- Enable DeepSeek model on ROCm (#12662)

Core Engine

Use VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding in high batch size scenarios (#12368)

Security Update

Improve hash collision avoidance in prefix caching (#12621)
Add SPDX-License-Identifier headers to python source files (#12628)

Other

Enable FusedSDPA support for Intel Gaudi (HPU) (#12359)

What's Changed

Apply torch.compile to fused_moe/grouped_topk by @mgoin in https://github.com/vllm-project/vllm/pull/12637
doc: fixing minor typo in readme.md by @vicenteherrera in https://github.com/vllm-project/vllm/pull/12643
[Bugfix] fix moe_wna16 get_quant_method by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/12648
[Core] Silence unnecessary deprecation warnings by @russellb in https://github.com/vllm-project/vllm/pull/12620
[V1][Minor] Avoid frequently creating ConstantList by @WoosukKwon in https://github.com/vllm-project/vllm/pull/12653
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager by @ShawnD200 in https://github.com/vllm-project/vllm/pull/12608
[Hardware][Intel GPU] add XPU bf16 support by @jikunshang in https://github.com/vllm-project/vllm/pull/12392
[Misc] Add SPDX-License-Identifier headers to python source files by @russellb in https://github.com/vllm-project/vllm/pull/12628
[doc][misc] clarify VLLM_HOST_IP for multi-node inference by @youkaichao in https://github.com/vllm-project/vllm/pull/12667
[Doc] Deprecate Discord by @zhuohan123 in https://github.com/vllm-project/vllm/pull/12668
[Kernel] port sgl moe_align_block_size kernels by @chenyang78 in https://github.com/vllm-project/vllm/pull/12574
make sure mistral_common not imported for non-mistral models by @youkaichao in https://github.com/vllm-project/vllm/pull/12669
Properly check if all fused layers are in the list of targets by @eldarkurtic in https://github.com/vllm-project/vllm/pull/12666
Fix for attention layers to remain unquantized during moe_wn16 quant by @srikanthsrnvs in https://github.com/vllm-project/vllm/pull/12570
[cuda] manually import the correct pynvml module by @youkaichao in https://github.com/vllm-project/vllm/pull/12679
[ci/build] fix gh200 test by @youkaichao in https://github.com/vllm-project/vllm/pull/12681
[Model]: Add transformers backend support by @ArthurZucker in https://github.com/vllm-project/vllm/pull/11330
[Misc] Fix improper placement of SPDX header in scripts by @russellb in https://github.com/vllm-project/vllm/pull/12694
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/12696
Squelch MLA warning for Compressed-Tensors Models by @kylesayrs in https://github.com/vllm-project/vllm/pull/12704
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 by @kushanam in https://github.com/vllm-project/vllm/pull/12707
[MISC] Remove model input dumping when exception by @comaniac in https://github.com/vllm-project/vllm/pull/12582
[V1] Revert uncache_blocks and support recaching full blocks by @comaniac in https://github.com/vllm-project/vllm/pull/12415
[Core] Improve hash collision avoidance in prefix caching by @russellb in https://github.com/vllm-project/vllm/pull/12621
Support Pixtral-Large HF by using llava multimodal_projector_bias config by @mgoin in https://github.com/vllm-project/vllm/pull/12710
[Doc] Replace ibm-fms with ibm-ai-platform by @tdoublep in https://github.com/vllm-project/vllm/pull/12709
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs by @kylesayrs in https://github.com/vllm-project/vllm/pull/12711
[AMD][ROCm] Enable DeepSeek model on ROCm by @hongxiayang in https://github.com/vllm-project/vllm/pull/12662
[Misc] Add BNB quantization for Whisper by @jeejeelee in https://github.com/vllm-project/vllm/pull/12381
[VLM] Merged multi-modal processor for InternVL-based models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/12553
[V1] Remove constraints on partial requests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/12674
[VLM] Implement merged multimodal processor and V1 support for idefics3 by @Isotr0py in https://github.com/vllm-project/vllm/pull/12660
[Model] [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small by @mgtk77 in https://github.com/vllm-project/vllm/pull/12689
Avoid unnecessary multi-modal input data copy when len(batch) == 1 by @imkero in https://github.com/vllm-project/vllm/pull/12722
[Build] update requirements of no-device for plugin usage by @sducouedic in https://github.com/vllm-project/vllm/pull/12630
[Bugfix] Fix CI failures for InternVL and Mantis models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/12728
[V1][Metrics] Add request_success_total counter, labelled with finish reason by @markmc in https://github.com/vllm-project/vllm/pull/12579
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/12676
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS by @akeshet in https://github.com/vllm-project/vllm/pull/12368
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling by @maleksan85 in https://github.com/vllm-project/vllm/pull/12713
Refactor Linear handling in TransformersModel by @hmellor in https://github.com/vllm-project/vllm/pull/12727
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models by @Isotr0py in https://github.com/vllm-project/vllm/pull/12729
[Misc] Bump the compressed-tensors version by @dsikka in https://github.com/vllm-project/vllm/pull/12736
[Model][Quant] Fix GLM, Fix fused module mappings for quantization by @kylesayrs in https://github.com/vllm-project/vllm/pull/12634
[Doc] Update PR Reminder with link to Developer Slack by @mgoin in https://github.com/vllm-project/vllm/pull/12748
[Bugfix] Fix OpenVINO model runner by @hmellor in https://github.com/vllm-project/vllm/pull/12750
[V1][Misc] Shorten FinishReason enum and use constant strings by @njhill in https://github.com/vllm-project/vllm/pull/12760
[Doc] Remove performance warning for auto_awq.md by @mgoin in https://github.com/vllm-project/vllm/pull/12743
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 by @Akashcodes732 in https://github.com/vllm-project/vllm/pull/12546
[core][distributed] exact ray placement control by @youkaichao in https://github.com/vllm-project/vllm/pull/12732
[Kernel] Use self.kv_cache and forward_context.attn_metadata in Attention.forward by @heheda12345 in https://github.com/vllm-project/vllm/pull/12536
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) by @SanjuCSudhakaran in https://github.com/vllm-project/vllm/pull/12359
Add: Support for Sparse24Bitmask Compressed Models by @rahul-tuli in https://github.com/vllm-project/vllm/pull/12097
[VLM] Use shared field to pass token ids to model by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/12767
[Docs] Drop duplicate [source] links by @russellb in https://github.com/vllm-project/vllm/pull/12780
[VLM] Qwen2.5-VL by @ywang96 in https://github.com/vllm-project/vllm/pull/12604
[VLM] Update compatibility with transformers 4.49 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/12781
Quantization and MoE configs for GH200 machines by @arvindsun in https://github.com/vllm-project/vllm/pull/12717
[ROCm][Kernel] Using the correct warp_size value by @gshtras in https://github.com/vllm-project/vllm/pull/12789
[Bugfix] Better FP8 supported defaults by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/12796
[Misc][Easy] Remove the space from the file name by @houseroad in https://github.com/vllm-project/vllm/pull/12799
[Model] LoRA Support for Ultravox model by @thedebugger in https://github.com/vllm-project/vllm/pull/11253
[Bugfix] Fix the test_ultravox.py's license by @houseroad in https://github.com/vllm-project/vllm/pull/12806
Improve TransformersModel UX by @hmellor in https://github.com/vllm-project/vllm/pull/12785
[Misc] Remove duplicated DeepSeek V2/V3 model definition by @mgoin in https://github.com/vllm-project/vllm/pull/12793
[Misc] Improve error message for incorrect pynvml by @youkaichao in https://github.com/vllm-project/vllm/pull/12809

New Contributors

@vicenteherrera made their first contribution in https://github.com/vllm-project/vllm/pull/12643
@chenyang78 made their first contribution in https://github.com/vllm-project/vllm/pull/12574
@srikanthsrnvs made their first contribution in https://github.com/vllm-project/vllm/pull/12570
@ArthurZucker made their first contribution in https://github.com/vllm-project/vllm/pull/11330
@mgtk77 made their first contribution in https://github.com/vllm-project/vllm/pull/12689
@sducouedic made their first contribution in https://github.com/vllm-project/vllm/pull/12630
@akeshet made their first contribution in https://github.com/vllm-project/vllm/pull/12368
@arvindsun made their first contribution in https://github.com/vllm-project/vllm/pull/12717
@thedebugger made their first contribution in https://github.com/vllm-project/vllm/pull/11253

Full Changelog: vllm-project/vllm@v0.7.1...v0.7.2

`v0.7.1`

Compare Source

Highlights

This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism

MLA Kernel (#12601, #12642,#12528).
FP8 Kernels (#11589, #11868, #12587)

V1

For the V1 architecture, we

Added a new design document for zero overhead prefix caching here (#12598)
Add metrics and enhance logging for V1 engine (#12569, #12561, #12416, #12516, #12530, #12478)

Models

New Model: MiniCPM-o (text outputs only) (#12069)

Hardwares

Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
AMD: llama 3.2 support upstreaming (#12421)

Others

Support override generation config in engine arguments (#12409)
Support reasoning content in API for deepseek R1 (#12473)

What's Changed

[Bugfix] Fix missing seq_start_loc in xformers prefill metadata by @Isotr0py in https://github.com/vllm-project/vllm/pull/12464
[V1][Minor] Minor optimizations for update_from_output by @WoosukKwon in https://github.com/vllm-project/vllm/pull/12454
[Bugfix] Fix gpt2 GGUF inference by @Isotr0py in https://github.com/vllm-project/vllm/pull/12467
[Build] Only build 9.0a for scaled_mm and sparse kernels by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/12339
[V1][Metrics] Add initial Prometheus logger by @markmc in https://github.com/vllm-project/vllm/pull/12416
[V1][CI/Test] Do basic test for top-p & top-k sampling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/12469
[FlashInfer] Upgrade to 0.2.0 by @abmfy in https://github.com/vllm-project/vllm/pull/11194
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill by @NickLucche in https://github.com/vllm-project/vllm/pull/10132
Update pre-commit hooks by @hmellor in https://github.com/vllm-project/vllm/pull/12475
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache by @liangfu in https://github.com/vllm-project/vllm/pull/11277
Fix bad path in prometheus example by @mgoin in https://github.com/vllm-project/vllm/pull/12481
[CI/Build] Fixed the xla nightly issue report in #12451 by @hosseinsarshar in https://github.com/vllm-project/vllm/pull/12453
[FEATURE] Enables offline /score for embedding models by @gmarinho2 in https://github.com/vllm-project/vllm/pull/12021
[CI] fix pre-commit error by @MengqingCao in https://github.com/vllm-project/vllm/pull/12494
Update README.md with V1 alpha release by @ywang96 in https://github.com/vllm-project/vllm/pull/12495
[V1] Include Engine Version in Logs by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/12496
[Core] Make raw_request optional in ServingCompletion by @schoennenbeck in https://github.com/vllm-project/vllm/pull/12503
[VLM] Merged multi-modal processor and V1 support for Qwen-VL by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/12504
[Doc] Fix typo for x86 CPU installation by @waltforme in https://github.com/vllm-project/vllm/pull/12514
[V1][Metrics] Hook up IterationStats for Prometheus metrics by @markmc in https://github.com/vllm-project/vllm/pull/12478
Replace missed warning_once for rerank API by @mgoin in https://github.com/vllm-project/vllm/pull/12472
Do not run suggestion pre-commit hook multiple times by @hmellor in https://github.com/vllm-project/vllm/pull/12521
[V1][Metrics] Add per-request prompt/generation_tokens histograms by @markmc in https://github.com/vllm-project/vllm/pull/12516
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels by @fenghuizhang in https://github.com/vllm-project/vllm/pull/12482
[TPU] Add example for profiling TPU inference by @mgoin in https://github.com/vllm-project/vllm/pull/12531
[Frontend] Support reasoning content for deepseek r1 by @gaocegege in https://github.com/vllm-project/vllm/pull/12473
[Doc] Convert docs to use colon fences by @hmellor in https://github.com/vllm-project/vllm/pull/12471
[V1][Metrics] Add TTFT and TPOT histograms by @markmc in https://github.com/vllm-project/vllm/pull/12530
Bugfix for whisper quantization due to fake k_proj bias by @mgoin in https://github.com/vllm-project/vllm/pull/12524
[V1] Improve Error Message for Unsupported Config by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/12535
Fix the pydantic logging validator by @maxdebayser in https://github.com/vllm-project/vllm/pull/12420
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/12347
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM by @HwwwwwwwH in https://github.com/vllm-project/vllm/pull/12069
[Frontend] Support override generation config in args by @liuyanyi in https://github.com/vllm-project/vllm/pull/12409
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. by @pavanimajety in https://github.com/vllm-project/vllm/pull/11787
[Kernel] add triton fused moe kernel for gptq/awq by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/12185
Revert "[Build/CI] Fix libcuda.so linkage" by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/12552
[V1][BugFix] Free encoder cache for aborted requests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/12545
[Misc][MoE] add Deepseek-V3 moe tuning support by @divakar-amd in https://github.com/vllm-project/vllm/pull/12558
[V1][Metrics] Add GPU cache usage % gauge by @markmc in https://github.com/vllm-project/vllm/pull/12561
Set ?device={device} when changing tab in installation guides by @hmellor in https://github.com/vllm-project/vllm/pull/12560
[Misc] fix typo: add missing space in lora adapter error message by @Beim in https://github.com/vllm-project/vllm/pull/12564
[Kernel] Triton Configs for Fp8 Block Quantization by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11589
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies by @npanpaliya in https://github.com/vllm-project/vllm/pull/12555
[V1][Log] Add max request concurrency log to V1 by @mgoin in https://github.com/vllm-project/vllm/pull/12569
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/11868
[ROCm][AMD][Model] llama 3.2 support upstreaming by @maleksan85 in https://github.com/vllm-project/vllm/pull/12421
[Attention] MLA decode optimizations by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/12528
[Bugfix] Gracefully handle huggingface hub http error by @ywang96 in https://github.com/vllm-project/vllm/pull/12571
Add favicon to docs by @hmellor in https://github.com/vllm-project/vllm/pull/12611
[BugFix] Fix Torch.Compile For DeepSeek by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/12594
[Git] Automatically sign-off commits by @comaniac in https://github.com/vllm-project/vllm/pull/12595
[Docs][V1] Prefix caching design by @comaniac in https://github.com/vllm-project/vllm/pull/12598
[v1][Bugfix] Add extra_keys to block_hash for prefix caching by @heheda12345 in https://github.com/vllm-project/vllm/pull/12603
[release] Add input step to ask for Release version by @khluu in https://github.com/vllm-project/vllm/pull/12631
[Bugfix] Revert MoE Triton Config Default by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/12629
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/12587
[Feature] Fix guided decoding blocking bitmask memcpy by @xpbowler in https://github.com/vllm-project/vllm/pull/12563
[Doc] Improve installation signposting by @hmellor in https://github.com/vllm-project/vllm/pull/12575
[Doc] int4 w4a16 example by @brian-dellabetta in https://github.com/vllm-project/vllm/pull/12585
[V1] Bugfix: Validate Model Input Length by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/12600
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 by @sleepwalker2017 in https://github.com/vllm-project/vllm/pull/11161
Fix target matching for fused layers with compressed-tensors by @eldarkurtic in https://github.com/vllm-project/vllm/pull/12617
[ci] Upgrade transformers to 4.48.2 in CI dependencies by @khluu in https://github.com/vllm-project/vllm/pull/12599
[Bugfix/CI] Fixup benchmark_moe.py by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/12562
Fix: Respect sparsity_config.ignore in Cutlass Integration by @rahul-tuli in https://github.com/vllm-project/vllm/pull/12517
[Attention] Deepseek v3 MLA support with FP8 compute by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/12601
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 by @russellb in https://github.com/vllm-project/vllm/pull/12280
Disable chunked prefill and/or prefix caching when MLA is enabled by @simon-mo in https://github.com/vllm-project/vllm/pull/12642

New Contributors

@abmfy made their first contribution in https://github.com/vllm-project/vllm/pull/11194
@hosseinsarshar made their first contribution in https://github.com/vllm-project/vllm/pull/12453
@gmarinho2 made their first contribution in https://github.com/vllm-project/vllm/pull/12021
@waltforme made their first contribution in https://github.com/vllm-project/vllm/pull/12514
@fenghuizhang made their first contribution in https://github.com/vllm-project/vllm/pull/12482
@gaocegege made their first contribution in https://github.com/vllm-project/vllm/pull/12473
@Beim made their first contribution in https://github.com/vllm-project/vllm/pull/12564
@xpbowler made their first contribution in https://github.com/vllm-project/vllm/pull/12563
@brian-dellabetta made their first contribution in https://github.com/vllm-project/vllm/pull/12585
@sleepwalker2017 made their first contribution in https://github.com/vllm-project/vllm/pull/11161
@eldarkurtic made their first contribution in https://github.com/vllm-project/vllm/pull/12617

Full Changelog: vllm-project/vllm@v0.7.0...v0.7.1

`v0.7.0`

Compare Source

Highlights

vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable VLLM_USE_V1=1. See our blog for more details. (44 commits).
New methods (LLM.sleep, LLM.wake_up, LLM.collective_rpc, LLM.reset_prefix_cache) in vLLM for the post training frameworks! (#12361, #12084, #12284).
torch.compile is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via -O3 engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).

This release features

400 commits from 132 contributors, including 57 new contributors.
- 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
- 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
- more than 161 bug fixes and miscellaneous enhancements

Features

Models

New generative models: CogAgent (#11742), Deepseek-VL2 (#11578, #12068, #12169), fairseq2 Llama (#11442), InternLM3 (#12037), Whisper (#11280)
New pooling models: Qwen2 PRM (#12202), InternLM2 reward models (#11571)
VLM: Merged multi-modal processor is now ready for model developers! (#11620, #11900, #11682, #11717, #11669, #11396)
- Any model that implements merged multi-modal processor and the get_*_embeddings methods according to this guide is automatically supported by V1 engine.

Hardwares

Apple: Native support for macOS Apple Silicon (#11696)
AMD: MI300 FP8 format for block_quant (#12134), Tuned MoE configurations for multiple models (#12408, #12049), block size heuristic for avg 2.8x speedup for int8 models (#11698)
TPU: support for W8A8 (#11785)
x86: Multi-LoRA (#11100) and MoE Support (#11831)
Progress in out-of-tree hardware support (#12009, #11981, #11948, #11609, #12264, #11516, #11503, #11369, #11602)

Features

Distributed:
- Support torchrun and SPMD-style offline inference (#12071)
- New collective_rpc abstraction (#12151, #11256)
API Server: Jina- and Cohere-compatible Rerank API (#12376)
Kernels:
- Flash Attention 3 Support (#12093)
- Punica prefill kernels fusion (#11234)
- For Deepseek V3: optimize moe_align_block_size for cuda graph and large num_experts (#12222)

Others

Benchmark: new script for CPU offloading (#11533)
Security: Set weights_only=True when using torch.load() (#12366)

What's Changed

[Docs] Document Deepseek V3 support by @simon-mo in https://github.com/vllm-project/vllm/pull/11535
Update openai_compatible_server.md by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11536
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/11394
[V1] Fix yapf by @WoosukKwon in https://github.com/vllm-project/vllm/pull/11538
[CI] Fix broken CI by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11543
[misc] fix typing by @youkaichao in https://github.com/vllm-project/vllm/pull/11540
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11534
[BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11547
[Platform] Move model arch check to platform by @MengqingCao in https://github.com/vllm-project/vllm/pull/11503
Update deploying_with_k8s.md with AMD ROCm GPU example by @AlexHe99 in https://github.com/vllm-project/vllm/pull/11465
[Bugfix] Fix TeleChat2ForCausalLM weights mapper by @jeejeelee in https://github.com/vllm-project/vllm/pull/11546
[Misc] Abstract out the logic for reading and writing media content by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/11527
[Doc] Add xgrammar in doc by @Chen-0210 in https://github.com/vllm-project/vllm/pull/11549
[VLM] Support caching in merged multi-modal processor by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/11396
[MODEL] Update LoRA modules supported by Jamba by @ErezSC42 in https://github.com/vllm-project/vllm/pull/11209
[Misc]Add BNB quantization for MolmoForCausalLM by @jeejeelee in https://github.com/vllm-project/vllm/pull/11551
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix by @Isotr0py in https://github.com/vllm-project/vllm/pull/11566
[Bugfix] Fix for ROCM compressed tensor support by @selalipop in https://github.com/vllm-project/vllm/pull/11561
[Doc] Update mllama example based on official doc by @heheda12345 in https://github.com/vllm-project/vllm/pull/11567
[V1] [4/N] API Server: ZMQ/MP Utilities by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11541
[Bugfix] Last token measurement fix by @rajveerb in https://github.com/vllm-project/vllm/pull/11376
[Model] Support InternLM2 Reward models by @Isotr0py in https://github.com/vllm-project/vllm/pull/11571
[Model] Remove hardcoded image tokens ids from Pixtral by @ywang96 in https://github.com/vllm-project/vllm/pull/11582
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version by @hj-wei in https://github.com/vllm-project/vllm/pull/11515
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/11581
[Doc] Minor documentation fixes by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/11580
[bugfix] interleaving sliding window for cohere2 model by @youkaichao in https://github.com/vllm-project/vllm/pull/11583
[V1] [5/N] API Server: unify Detokenizer and EngineCore input by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11545
[Doc] Convert list tables to MyST by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/11594
[v1][bugfix] fix cudagraph with inplace buffer assignment by @youkaichao in https://github.com/vllm-project/vllm/pull/11596
[Misc] Use registry-based initialization for KV cache transfer connector. by @KuntaiDu in https://github.com/vllm-project/vllm/pull/11481
Remove print statement in DeepseekScalingRotaryEmbedding by @mgoin in https://github.com/vllm-project/vllm/pull/11604
[v1] fix compilation cache by @youkaichao in https://github.com/vllm-project/vllm/pull/11598
[Docker] bump up neuron sdk v2.21 by @liangfu in https://github.com/vllm-project/vllm/pull/11593
[Build][Kernel] Update CUTLASS to v3.6.0 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/11607
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels by @bigPYJ1151 in [https

Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

renovate bot force-pushed the renovate/pypi-vllm-vulnerability branch from c5accc9 to 72fc532 Compare February 6, 2025 16:40

Update dependency vllm to v0.7.2 [SECURITY]

bf5236d

renovate bot force-pushed the renovate/pypi-vllm-vulnerability branch from 72fc532 to bf5236d Compare February 8, 2025 16:10

renovate bot changed the title ~~Update dependency vllm to v0.7.0 [SECURITY]~~ Update dependency vllm to v0.7.2 [SECURITY] Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependency vllm to v0.7.2 [SECURITY] #16

Update dependency vllm to v0.7.2 [SECURITY] #16

renovate bot commented Jan 29, 2025 •

edited

Loading

Update dependency vllm to v0.7.2 [SECURITY] #16

Are you sure you want to change the base?

Update dependency vllm to v0.7.2 [SECURITY] #16

Conversation

renovate bot commented Jan 29, 2025 • edited Loading

GitHub Vulnerability Alerts

CVE-2025-24357

Description

Impact

References

CVE-2025-25183

Summary

Details

Impact

Solution

References

Release Notes

v0.7.2

Highlights

Core Engine

Security Update

Other

What's Changed

New Contributors

v0.7.1

Highlights

V1

Models

Hardwares

Others

What's Changed

New Contributors

v0.7.0

Highlights

Features

Others

What's Changed

Configuration

renovate bot commented Jan 29, 2025 •

edited

Loading

`v0.7.2`

`v0.7.1`

`v0.7.0`