Update dependency vllm to v0.7.2 [SECURITY] #16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==0.6.1
->==0.7.2
GitHub Vulnerability Alerts
CVE-2025-24357
Description
The vllm/model_executor/weight_utils.py implements hf_model_weights_iterator to load the model checkpoint, which is downloaded from huggingface. It use torch.load function and weights_only parameter is default value False. There is a security warning on https://pytorch.org/docs/stable/generated/torch.load.html, when torch.load load a malicious pickle data it will execute arbitrary code during unpickling.
Impact
This vulnerability can be exploited to execute arbitrary codes and OS commands in the victim machine who fetch the pretrained repo remotely.
Note that most models now use the safetensors format, which is not vulnerable to this issue.
References
CVE-2025-25183
Summary
Maliciously constructed prompts can lead to hash collisions, resulting in prefix cache reuse, which can interfere with subsequent responses and cause unintended behavior.
Details
vLLM's prefix caching makes use of Python's built-in hash() function. As of Python 3.12, the behavior of hash(None) has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions.
Impact
The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use.
Solution
We address this problem by initializing hashes in vllm with a value that is no longer constant and predictable. It will be different each time vllm runs. This restores behavior we got in Python versions prior to 3.12.
Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. Hash collisions may still occur, though they are no longer straight forward to predict.
To give an idea of the likelihood of a collision, for randomly generated hash values (assuming the hash generation built into Python is uniformly distributed), with a cache capacity of 50,000 messages and an average prompt length of 300, a collision will occur on average once every 1 trillion requests.
References
Release Notes
vllm-project/vllm (vllm)
v0.7.2
Compare Source
Highlights
transformers
library at the moment (#12604)transformers
backend support via--model-impl=transformers
. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727).torch.compile
to fused_moe/grouped_topk, yielding 5% throughput enhancement (#12637)Core Engine
VLLM_LOGITS_PROCESSOR_THREADS
to speed up structured decoding in high batch size scenarios (#12368)Security Update
Other
What's Changed
transformers
backend support by @ArthurZucker in https://github.com/vllm-project/vllm/pull/11330uncache_blocks
and support recaching full blocks by @comaniac in https://github.com/vllm-project/vllm/pull/12415VLLM_LOGITS_PROCESSOR_THREADS
by @akeshet in https://github.com/vllm-project/vllm/pull/12368Linear
handling inTransformersModel
by @hmellor in https://github.com/vllm-project/vllm/pull/12727FinishReason
enum and use constant strings by @njhill in https://github.com/vllm-project/vllm/pull/12760TransformersModel
UX by @hmellor in https://github.com/vllm-project/vllm/pull/12785New Contributors
Full Changelog: vllm-project/vllm@v0.7.1...v0.7.2
v0.7.1
Compare Source
Highlights
This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism
V1
For the V1 architecture, we
Models
Hardwares
Others
What's Changed
prompt_logprobs
with ChunkedPrefill by @NickLucche in https://github.com/vllm-project/vllm/pull/10132pre-commit
hooks by @hmellor in https://github.com/vllm-project/vllm/pull/12475suggestion
pre-commit
hook multiple times by @hmellor in https://github.com/vllm-project/vllm/pull/12521?device={device}
when changing tab in installation guides by @hmellor in https://github.com/vllm-project/vllm/pull/12560cutlass_scaled_mm
to support 2d group (blockwise) scaling by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/11868sparsity_config.ignore
in Cutlass Integration by @rahul-tuli in https://github.com/vllm-project/vllm/pull/12517New Contributors
Full Changelog: vllm-project/vllm@v0.7.0...v0.7.1
v0.7.0
Compare Source
Highlights
VLLM_USE_V1=1
. See our blog for more details. (44 commits).LLM.sleep
,LLM.wake_up
,LLM.collective_rpc
,LLM.reset_prefix_cache
) in vLLM for the post training frameworks! (#12361, #12084, #12284).torch.compile
is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via-O3
engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).This release features
Features
Models
get_*_embeddings
methods according to this guide is automatically supported by V1 engine.Hardwares
W8A8
(#11785)Features
collective_rpc
abstraction (#12151, #11256)moe_align_block_size
for cuda graph and large num_experts (#12222)Others
weights_only=True
when usingtorch.load()
(#12366)What's Changed
Detokenizer
andEngineCore
input by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11545Configuration
📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.