Skip to content

[Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. #19298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 9, 2025

Conversation

varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Jun 6, 2025

Purpose

Issue1:

The engine fails to initialize with DeepseekR1 + data-parallel-size 32 + expert-parallel + VLLM_ALL2ALL_BACKEND="deepep_high_throughput" + VLLM_USE_DEEP_GEMM=1

Cause:

During profile runs we execute model forward pass on all GPUs with maximum batch size. We do this with dummy input_ids all set to zeros. This has an effect of routing all tokens from all GPUs to a specific set of experts / GPU Ranks in the All2All dispatch call. The GPU receiving all the tokens has an enormous batch-size to process. The GPUs OOM as a result.

Issue2:

The engine fails to initialize with DeepseekR1 + data-parallel-size 128 + expert-parallel + VLLM_ALL2ALL_BACKEND="deepep_low_latency" + VLLM_USE_DEEP_GEMM=1

Cause:

RuntimeError: Failed: Assertion error /mnt/data/home/smo/vllm/tools/ep_kernels/ep_kernels_workspace/DeepEP/csrc/deep_ep.cpp:1040 'layout.total_bytes <= num_rdma_bytes'

Issue 3:

CUDA illegal memory access in FP8 block-quant triton kernel

Cause:

Integer overflow issues in the block-quant triton kernel.

Changes / Fixes:

  • [issue1 - fix] Randomize _dummy_run input_ids()
    • Randomize dummy run input IDs. This is so all experts of the model and consequently the GPU ranks, receive a balanced number of tokens.
  • [issue1 - fix] Reduce memory usage
    • DeepGemm MOE: Reuse preallocated workspaces for Quantization and for "inverse permutation" outputs.
    • DeepEP High Throughput Prepare/Finalize : Remove expensive torch ops in finalize function
  • [issue2 - fix] Ask DeepEP for num_rdma_bytes hint
    • DeepEP should give us the right requirements.
  • [issue3 - fix] Use int64 in offset calculations (stride multiplication) in the fp8 block-quant kernels.

Test Plan

Tested locally with :

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1  vllm serve Qwen/Qwen3-30B-A3B-FP8  --trust-remote-code  --data-parallel-size 2 --enable-expert-parallel --port 9010  --no-enable-prefix-caching

and

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.81|±  |0.0394|
|     |       |strict-match    |     5|exact_match|↑  | 0.90|±  |0.0302|

Copy link

github-actions bot commented Jun 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @varun-sundar-rabindranath, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone quickly understand the changes and context.

This PR addresses two specific issues encountered when using the DeepEP and DeepGEMM backends, particularly with high data parallelism and the DeepseekR1 model. The first issue involved Out-of-Memory (OOM) errors during profile runs with dummy zero inputs, which led to unbalanced token distribution across experts and GPUs. The second issue was an assertion failure related to RDMA buffer size when using the low-latency DeepEP backend with high data parallelism. To fix these, the PR introduces randomization for dummy run input IDs to balance expert load, optimizes memory usage in DeepGEMM by reusing preallocated workspaces, and replaces expensive PyTorch operations with potentially more efficient custom ops in the DeepEP high-throughput finalize function.

Highlights

  • DeepGEMM Memory Optimization: Modifies the DeepGEMM MOE layer to reuse preallocated workspaces for intermediate results, specifically for the activation output, the second quantization step's output, and the final output after inverse permutation. This reduces overall memory footprint.
  • DeepEP High Throughput Optimization: Replaces the torch.sum(dim=1) operation in the DeepEP high-throughput finalize function with a custom ops.moe_sum kernel. This aims to improve performance and potentially reduce memory overhead during the final combination step.
  • New Environment Variable: Adds a new environment variable VLLM_RANDOMIZE_DP_DUMMY_INPUTS to control whether dummy inputs should be randomized during data parallel dummy runs.

Changelog

Click here to see the changelog
  • vllm/envs.py
    • Added VLLM_RANDOMIZE_DP_DUMMY_INPUTS boolean environment variable (default: False) at line 113.
    • Added logic to parse the VLLM_RANDOMIZE_DP_DUMMY_INPUTS environment variable (checking for '1') at line 765.
  • vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
    • Adjusted the calculation for workspace2 size in workspace_shapes to M_sum * max(N, K) from M_sum * N at line 87.
    • Modified the apply method to reuse workspace13 for mm1_out and quant_out (viewed as float8) and workspace2 for act_out and mm2_out at lines 140-145.
    • Replaced the tensor indexing workspace3[inv_perm, ...] with torch.index_select(mm2_out, 0, inv_perm, out=out) to perform the inverse permutation directly into the preallocated out tensor at line 161.
  • vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py
    • Imported vllm._custom_ops as ops at line 8.
    • Replaced fused_expert_output.sum(dim=1).to(output_dtype) with creating an output tensor out and calling ops.moe_sum(fused_expert_output, out) at lines 208-211.
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py
    • Added an optional out_q: Optional[torch.Tensor] = None parameter to the per_token_group_quant_fp8 function signature at line 314.
    • Updated the docstring to describe the new out_q parameter at line 326.
    • Modified the function to use the provided out_q tensor if available, otherwise create a new one, at lines 341-344.
  • vllm/v1/worker/gpu_model_runner.py
    • Imported contextmanager from contextlib at line 8.
    • Imported vllm.envs at line 16.
    • Added a new context manager maybe_randomize_inputs at lines 1726-1753, which randomizes input_ids if VLLM_RANDOMIZE_DP_DUMMY_INPUTS is true and DP size > 1, and resets them to zeros upon exiting the context.
    • Applied the maybe_randomize_inputs context manager around the _dummy_run call at line 1835.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the v1 label Jun 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces crucial fixes and optimizations for DeepEP and DeepGEMM, addressing OOM errors and enhancing memory efficiency. The randomization of dummy inputs is a key fix for expert load balancing, and the memory optimizations in both DeepGEMM and DeepEP modules are well-implemented. The code is clear and the changes are well-justified by the PR description.

Summary of Findings

  • Correctness of Workspace Sizing: In deep_gemm_moe.py, the workspace2 sizing was updated. While the new sizing M_sum * max(N, K) correctly fixes issues when K > N, a slightly tighter bound could be M_sum * max(N // 2, K) based on its direct usages for act_out and mm2_out. The current approach is safe, however.
  • Memory Optimizations: Significant memory optimizations were made by reusing tensors (e.g., workspace13, workspace2 in DeepGEMM, out_q in quantization) and using in-place operations or custom ops (e.g., mul_ and ops.moe_sum in DeepEP HT finalize). These are excellent for performance and reducing OOM risks.
  • Dummy Run Input Randomization: The new VLLM_RANDOMIZE_DP_DUMMY_INPUTS flag and the maybe_randomize_inputs context manager effectively address the expert load imbalance OOM during dummy/profile runs. The implementation is clear and robust.

Merge Readiness

The pull request introduces important fixes and optimizations. The changes are well-implemented and address the described issues effectively. After considering the feedback, I believe this PR is in good shape for merging. I am unable to approve the pull request myself, so please ensure it is reviewed and approved by other maintainers before merging.

fused_expert_output.size(0), -1, 1)
fused_expert_output = fused_expert_output.to(output_dtype)
m_x_topk = fused_expert_output.size(0)
fused_expert_output.mul_(topk_weights.view(m_x_topk, -1, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The in-place multiplication fused_expert_output.mul_(topk_weights.view(m_x_topk, -1, 1)) is a good optimization, replacing the previous type casting and out-of-place multiplication. This assumes that the dtypes of fused_expert_output and topk_weights are compatible for in-place multiplication, or that PyTorch's type promotion handles it correctly without unexpected side effects. Given this is for performance, this is likely fine, but it's good to be mindful of the dtypes involved. Could you confirm if fused_expert_output's dtype (coming from DeepEP kernels) is consistently compatible with topk_weights (usually float/bfloat16) for this in-place operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topk_weights is float32 - and fused_expert_output is bfloat16 - the multiplication relies on type promotion/demotion.

Comment on lines +341 to +357
assert out_q is None or out_q.shape == x.shape
x_q = out_q
if x_q is None:
x_q = torch.empty_like(x, device=x.device, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The addition of the out_q parameter to allow for in-place quantization is a good memory optimization. The assertion assert out_q is None or out_q.shape == x.shape and the conditional assignment x_q = out_q if out_q is not None else torch.empty_like(...) are correctly implemented. This enhances flexibility and efficiency.

Comment on lines +1750 to +1756
input_ids.copy_(rand_input_ids()[:input_ids.size(0)],
non_blocking=True)
yield
input_ids.fill_(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The maybe_randomize_inputs context manager correctly randomizes input_ids for dummy runs when the VLLM_RANDOMIZE_DP_DUMMY_INPUTS flag is set. The use of torch.randint_like and slicing ensures appropriate random token IDs are generated and copied.

Restoring input_ids to zeros with input_ids.fill_(0) after the yield is crucial for ensuring that subsequent operations or dummy runs (if randomization is off or for other purposes) start with a clean, expected state. This is well handled.

@@ -1801,7 +1832,7 @@ def _dummy_run(
intermediate_tensors = self.sync_and_slice_intermediate_tensors(
num_tokens, None, False)

with set_forward_context(
with self.maybe_randomize_inputs(input_ids), set_forward_context(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is incorrect to do this when we are doing profile runs -- during profile runs, we do want the system to be stress tested (i.e. all tokens reaching the same set of GPU ranks).
However, at the moment, without this we are not able to run large scale DP -- things OOM during profile_run.

#19168 should fix the OOM - then we can remove this logic for the profile run case.

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 6, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) June 6, 2025 22:14
input_ids.copy_(rand_input_ids()[:input_ids.size(0)],
non_blocking=True)
yield
input_ids.fill_(0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be optimized. We dont have to fill the input_ids and then set it to zeros every time. For eager-mode runs (i.e. batch size > 512) - we could just use the rand tensor in the place of input-ids .. I plan to do this in a follow up PR.

auto-merge was automatically disabled June 7, 2025 18:06

Head branch was pushed to by a user without write access


workspace3 = workspace3[inv_perm, ...]
torch.index_select(mm2_out, 0, inv_perm, out=out)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory optimization to prevent inv_perm from making a brand-new tensor.

Comment on lines +764 to +767
# Randomize inputs during dummy runs when using Data Parallel
"VLLM_RANDOMIZE_DP_DUMMY_INPUTS":
lambda: os.environ.get("VLLM_RANDOMIZE_DP_DUMMY_INPUTS", "0") == "1",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DP important here? I think you would want this for any EP case, so maybe just VLLM_RANDOMIZE_DUMMY_INPUTS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated it for a bit and I think it is better to call out DP in the name. It is only in the context of DP that some DP Ranks execute dummy-runs so we can synchronize with the DP Ranks that run the model with actual tokens.

— The other way we could do expert parallel is with DP=1 and TP > 1 - with this, all the ranks run with actual data (the input data is replicated across all ranks)

also, I have this statement in code,
randomize_inputs = envs.VLLM_RANDOMIZE_DP_DUMMY_INPUTS and dp_size > 1

But I see what you are saying, we could do,
VLLM_RANDOMIZE_DUMMY_INPUTS -> VLLM_RANDOMIZE_DUMMY_INPUTS and randomize_inputs = envs.VLLM_RANDOMIZE_DP_DUMMY_INPUTS and randomize if the env var is just set.
let's do it when more use cases for randomizing dummy runs come up ? What do you think ?

Varun added 8 commits June 7, 2025 20:48
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Varun <vsundarr@redhat.com>
@varun-sundar-rabindranath
Copy link
Contributor Author

varun-sundar-rabindranath commented Jun 7, 2025

^ rebase on to main

@mgoin mgoin merged commit 5cf2dae into vllm-project:main Jun 9, 2025
78 checks passed
amogkam added a commit to character-tech/vllm that referenced this pull request Jun 16, 2025
* [doc] clarify windows support (vllm-project#19088)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [CI/Build] Remove V0 LoRA test (vllm-project#19066)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Fix underscores in dict keys passed via CLI (vllm-project#19030)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Bugfix] disable processor cache  (vllm-project#19068)

Signed-off-by: raushan <raushan@huggingface.co>

* [Doc] Improve the Pull Request template with key components (vllm-project#19086)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Misc] Add missing `_Backend` enums (vllm-project#19081)

Signed-off-by: nicklucche <nlucches@redhat.com>

* [Misc] fix: add miss best_of param validation (vllm-project#18555)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [Misc] Add SPDX-FileCopyrightText  (vllm-project#19100)

Signed-off-by: simon-mo <simon.mo@hey.com>

* [Doc] Readme standardization (vllm-project#18695)

Co-authored-by: Soren Dreano <soren@numind.ai>

* [doc] update docker version (vllm-project#19074)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* [V1] Support cross-layer KV sharing (vllm-project#18212)

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

* [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844)

Signed-off-by: mgoin <mgoin64@gmail.com>

* Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikun@apache.org>

* [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971)

* [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411)

Signed-off-by: nicklucche <nlucches@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

* [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* feat: add data parallel rank to KVEventBatch (vllm-project#18925)

* [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919)

* [Docs] Add developer doc about CI failures (vllm-project#18782)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [CPU] V1 support for the CPU backend (vllm-project#16441)

* [Core] Cast multimodal input in hf processor (vllm-project#18862)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437)

* [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059)

Signed-off-by: calvin chen <120380290@qq.com>

* [NVIDIA] Add Cutlass MLA backend (vllm-project#17625)

* [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* Fix vllm-project#19130 (vllm-project#19132)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [TPU] Skip hanging tests (vllm-project#19115)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

* [Misc] Add packages for benchmark as extra dependency (vllm-project#19089)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Improve the output precision of embedding models (vllm-project#19092)

* [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Add DeepSeek-R1-0528 function call chat template (vllm-project#18874)

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

* Sm100 blockwise fp8 swap ab (vllm-project#18564)

* [Doc] Update V1 Guide for embedding models (vllm-project#19141)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* [Bugfix][EP+DP] Fix internode check (vllm-project#19112)

Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>

* [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [TPU] Update dynamo dump file name in compilation test (vllm-project#19108)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121)

* [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>

* [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [P/D] Heterogeneous TP (vllm-project#18833)

Signed-off-by: nicklucche <nlucches@redhat.com>

* [doc] small fix (vllm-project#19167)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>

* [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117)

* [Torch Nightly]add missing dependency (vllm-project#18770)

Signed-off-by: Yang Wang <elainewy@meta.com>

* Handle non-serializable objects when dumping benchmark results (vllm-project#19114)

* [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Build] Annotate wheel and container path for release workflow (vllm-project#19162)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Frontend] improve vllm run-batch --help display (vllm-project#19187)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202)

Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>

* [mistral_common] Add v11 tokenizer (vllm-project#19193)

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205)

* [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110)

Signed-off-by: Chiyue Wei <chiyuew@nvidia.com>
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com>

* [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226)

Signed-off-by: Povilas Kanapickas <povilas@radix.lt>

* [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090)

* [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217)

* [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118)

* [Model] NemotronH support (vllm-project#18863)

Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>

* Fix AOPerModuleConfig name changes (vllm-project#18869)

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

* [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033)

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

* [v1] Hybrid Memory Allocator (vllm-project#17996)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [TPU] update torch_xla pin (vllm-project#19231)

Signed-off-by: Chengji Yao <chengjiyao@google.com>

* Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143)

Signed-off-by: Xu Song <xusong.vip@gmail.com>

* [Chore] update CODEOWNERS (vllm-project#19247)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182)

Co-authored-by: jinghui <jinghui@fb.com>

* [TPU] fix kv cache dtype in model runner (vllm-project#19244)

Signed-off-by: Chengji Yao <chengjiyao@google.com>

* [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224)

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

* [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172)

Signed-off-by: Nick Hill <nhill@redhat.com>

* Fix CompilationConfig repr (vllm-project#19091)

Signed-off-by: rzou <zou3519@gmail.com>

* Unit Test for run_dp_sharded_vision_model (vllm-project#19103)

Signed-off-by: Siqi Yan <siqi@meta.com>
Co-authored-by: Siqi Yan <siqi@meta.com>

* [Model] Optimize nemotron_h implementation (vllm-project#19249)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* improve logits bias (vllm-project#19041)

* Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422)

Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>

* [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225)

Co-authored-by: Adolfo Victoria <adovi@meta.com>

* [Core] Fix abrupt request abort (vllm-project#18485)

Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>

Co-authored-by: Nick Hill <nhill@redhat.com>

* [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163)

Signed-off-by: Chenyaaang <chenyangli@google.com>

* [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039)

Signed-off-by: Qiliang Cui <derrhein@gmail.com>

* [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253)

Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com>

* Add FlexAttention to V1 (vllm-project#16078)

Signed-off-by: drisspg <drisspguessous@gmail.com>

* [Misc] refactor context extension (vllm-project#19246)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311)

Signed-off-by: Lifan Shen <lifans@meta.com>

* [AMD] Update compatible packaging version (vllm-project#19309)

Signed-off-by: pramkuma <Pramendra.Kumar@amd.com>

* [BugFix][V1] Fix memory profiling bug (vllm-project#18974)

Signed-off-by: luka <luka@neuralmagic.com>

* [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302)

Signed-off-by: rzou <zou3519@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315)

Signed-off-by: Xu Wenqing <xuwq1993@qq.com>

* [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082)

Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>

* [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312)

* [Multi Modal] Add an env var for message queue max chunk bytes  (vllm-project#19242)

Signed-off-by: yZhen <yZhen@fb.com>
Co-authored-by: yZhen <yZhen@fb.com>

* [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201)

* [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Add documentation update reminder to PR template (vllm-project#19289)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Frontend] Remove unreachable code from llm.py (vllm-project#19288)

Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com>

* [Misc] Cleanup compilation tests (vllm-project#19343)

Signed-off-by: rzou <zou3519@gmail.com>

* [doc] improve ci doc (vllm-project#19307)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333)

Signed-off-by: cr7258 <chengzw258@163.com>

* [CI/Build] Fix LoRA test (vllm-project#19350)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328)

Signed-off-by: Conroy Cheers <conroy@corncheese.org>

* [CI] Introduce rules for llama auto-label (vllm-project#19323)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Docs] Fix a bullet list in usage/security.md (vllm-project#19358)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* [full_graph] Fix query_start_loc padding (vllm-project#19321)

Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai>

* [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319)

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>

* [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Quantization] Bump compressed-tensors version (vllm-project#19295)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472)

Signed-off-by: liusiqian <liusiqian@tal.com>

* [TPU]Fix KV cache sharing tests (vllm-project#19371)

* [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374)

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

* [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

* [Bugfix] Fix benchmark_moe.py (vllm-project#19016)

Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>

* Use xla flag to improve the quantized model performance (vllm-project#19303)

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382)

* [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Core] Use tuple for kv cache group block ids (vllm-project#19175)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Bugfix] Fix modelscope token passed in (vllm-project#19389)

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [Core] Batch multi modal input using pinned memory (vllm-project#19169)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* Add security warning to bug report template (vllm-project#19365)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Misc] refactor neuron_multimodal and profiling (vllm-project#19397)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* Add clear documentation around the impact of debugging flag (vllm-project#19369)

Signed-off-by: Anna Pendleton <pendleton@google.com>

* Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930)

Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>

* Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404)

* [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134)

Signed-off-by: Yunqiu Guo <guorachel@meta.com>

* [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411)

Signed-off-by: jiang.li <jiang1.li@intel.com>

* Simplify ep kernels installation (vllm-project#19412)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Slight improvement of the BNB  (vllm-project#19418)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [Docs] Note that alternative structured output backends are supported (vllm-project#19426)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [Model] use AutoWeightsLoader for commandr (vllm-project#19399)

Signed-off-by: py-andy-c <pychen1017@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401)

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

* [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390)

Signed-off-by: rzou <zou3519@gmail.com>

* [New Model]: Support Qwen3 Embedding & Reranker  (vllm-project#19260)

* [BugFix] Fix docker build cpu-dev image error (vllm-project#19394)

Signed-off-by: niu_he <carlton2tang@gmail.com>

* Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451)

Signed-off-by: Lu Fang <lufang@fb.com>

* [CI] Disable failing GGUF model test (vllm-project#19454)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455)

Signed-off-by: Junhao Li <junhao@ubicloud.com>

* Fix Typo in Documentation and Function Name (vllm-project#19442)

* [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Kernel] Support deep_gemm for linear methods (vllm-project#19085)

Signed-off-by: artetaout <lulala341@gmail.com>

* [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Fix quantization link titles (vllm-project#19478)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Support "important" and "announcement" admonitions (vllm-project#19479)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Reduce warning message introduced in env_override (vllm-project#19476)

Signed-off-by: Lu Fang <lufang@fb.com>

* Support non-string values in JSON keys from CLI (vllm-project#19471)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Add cache to cuda get_device_capability (vllm-project#19436)

Signed-off-by: mgoin <mgoin64@gmail.com>

* Fix some typo (vllm-project#19475)

Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>

* Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241)

Signed-off-by: Tsai, Louie <louie.tsai@intel.com>

* [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453)

Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com>

* [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [doc] fix "Other AI accelerators" getting started page (vllm-project#19457)

Signed-off-by: David Xia <david@davidxia.com>

* [Misc] Fix  misleading ROCm warning (vllm-project#19486)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Docs] Remove WIP features in V1 guide (vllm-project#19498)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168)

Signed-off-by: Bill Nell <bnell@redhat.com>

* [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>

* [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [CI] change spell checker from codespell to typos (vllm-project#18711)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518)

Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* [Frontend] Improve error message in tool_choice validation (vllm-project#19239)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522)

Signed-off-by: strutive07 <strutive07@gmail.com>

* [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* Fix typo (vllm-project#19525)

Signed-off-by: 2niuhe <carlton2tang@gmail.com>

* [Security] Prevent new imports of (cloud)pickle (vllm-project#18018)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>

* [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* [Quantization] Improve AWQ logic (vllm-project#19431)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Doc] Add V1 column to supported models list (vllm-project#19523)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1][NixlConnector] Drop `num_blocks` check  (vllm-project#19532)

Signed-off-by: NickLucche <nlucches@redhat.com>

* [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Fix TorchAOConfig skip layers (vllm-project#19265)

Signed-off-by: mobicham <hicham@mobiuslabs.com>

* [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>

* [doc] Make top navigation sticky (vllm-project#19540)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847)

* [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506)

* [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Doc] Unify structured outputs examples (vllm-project#18196)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [V1] Resolve failed concurrent structured output requests (vllm-project#19565)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378)

* [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570)

Signed-off-by: qizixi <qizixi@meta.com>

* [Doc] uses absolute links for structured outputs (vllm-project#19582)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [doc] fix incorrect link (vllm-project#19586)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Misc] Correct broken docs link (vllm-project#19553)

Signed-off-by: Zerohertz <ohg3417@gmail.com>

* [CPU] Refine default config for the CPU backend (vllm-project#19539)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [Fix] bump mistral common to support magistral (vllm-project#19533)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* use base version for version comparison (vllm-project#19587)

Signed-off-by: Boyuan Feng <boyuan@meta.com>

* [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Model] Fix minimax model cache & lm_head precision (vllm-project#19592)

Signed-off-by: qingjun <qingjun@minimaxi.com>

* [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [doc][mkdocs] fix the  duplicate Supported features sections in GPU docs (vllm-project#19606)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

Signed-off-by: luka <luka@neuralmagic.com>

* [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377)

Signed-off-by: Anna Pendleton <pendleton@google.com>

* [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618)

* Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508)

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354)

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

* [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633)

* [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316)

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500)

* Only build CUTLASS MoE kernels on Hopper (vllm-project#19648)

* [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561)

* [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262)

* [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566)

* [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644)

* [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Enable prefix caching with full cuda graphs (vllm-project#19617)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589)

* [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [MISC] Remove unused variableds in C++ (vllm-project#19609)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957)

Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>

* [Misc][Frontend] passthrough `bad_words` (vllm-project#19564)

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>

* [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

* [TPU] support attention head dim smaller than 128 (vllm-project#19620)

Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>

* [MISC] typo fix (vllm-project#19672)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [CI] Add mteb testing for rerank models (vllm-project#19344)

* [Docs] Move multiproc doc to v1 dir (vllm-project#19651)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754)

Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>

* [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557)

* [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652)

Signed-off-by: Shawn Tan <shawntan@ibm.com>

* [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Kernels] Use empty for modular MoE workspaces (vllm-project#19667)

Signed-off-by: Bill Nell <bnell@redhat.com>

* [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677)

Signed-off-by: QscQ <qscqesze@gmail.com>

* [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Lu Fang <lufang@fb.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: googs1025 <googs1025@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: calvin chen <120380290@qq.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
Signed-off-by: Jon Swenson <jmswen@gmail.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Chiyue Wei <chiyuew@nvidia.com>
Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Signed-off-by: Xu Song <xusong.vip@gmail.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: rzou <zou3519@gmail.com>
Signed-off-by: Siqi Yan <siqi@meta.com>
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Chenyaaang <chenyangli@google.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com>
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: Lifan Shen <lifans@meta.com>
Signed-off-by: pramkuma <Pramendra.Kumar@amd.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Xu Wenqing <xuwq1993@qq.com>
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Signed-off-by: yZhen <yZhen@fb.com>
Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com>
Signed-off-by: cr7258 <chengzw258@163.com>
Signed-off-by: Conroy Cheers <conroy@corncheese.org>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: liusiqian <liusiqian@tal.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: Yunqiu Guo <guorachel@meta.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: py-andy-c <pychen1017@gmail.com>
Signed-off-by: niu_he <carlton2tang@gmail.com>
Signed-off-by: Junhao Li <junhao@ubicloud.com>
Signed-off-by: artetaout <lulala341@gmail.com>
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com>
Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com>
Signed-off-by: David Xia <david@davidxia.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: strutive07 <strutive07@gmail.com>
Signed-off-by: 2niuhe <carlton2tang@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: mobicham <hicham@mobiuslabs.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: qingjun <qingjun@minimaxi.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
Signed-off-by: Shawn Tan <shawntan@ibm.com>
Signed-off-by: QscQ <qscqesze@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: CYJiang <86391540+googs1025@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Yikun Jiang <yikun@apache.org>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Calvin Chen <45745657+calvin0327@users.noreply.github.com>
Co-authored-by: Kaixi Hou <kaixih@nvidia.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Xu Wenqing <121550081+Xu-Wenqing@users.noreply.github.com>
Co-authored-by: Lain <fusiyuan2000@hotmail.com>
Co-authored-by: jmswen <jmswen@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Yang Wang <elainewy@meta.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Chiyue Wei <92623189+dubcyfor3@users.noreply.github.com>
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com>
Co-authored-by: Povilas Kanapickas <povilas@radix.lt>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Luis Vega <vegaluisjose@users.noreply.github.com>
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Xu Song <xusong.vip@gmail.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Jinghui Zhang <jinghuizhang0804@gmail.com>
Co-authored-by: jinghui <jinghui@fb.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Siqi Yan <ysq0807@hotmail.com>
Co-authored-by: Siqi Yan <siqi@meta.com>
Co-authored-by: Yu Guo <82124926+yuguo68@users.noreply.github.com>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Co-authored-by: Adolfo Victoria <adolfokarim@gmail.com>
Co-authored-by: Adolfo Victoria <adovi@meta.com>
Co-authored-by: Chenyaaang <42742451+Chenyaaang@users.noreply.github.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: QiliangCui <derrhein@gmail.com>
Co-authored-by: Aaruni Aggarwal <47731267+AaruniAggarwal@users.noreply.github.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: Lifans <draftbks@gmail.com>
Co-authored-by: pramenku <7664080+pramenku@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Akash kaothalkar <61960177+Akashcodes732@users.noreply.github.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: jennyyyyzhen <47012288+jennyyyyzhen@users.noreply.github.com>
Co-authored-by: yZhen <yZhen@fb.com>
Co-authored-by: Kseniya Parkhamchuk <43078183+KsuParkhamchuk@users.noreply.github.com>
Co-authored-by: Se7en <chengzw258@163.com>
Co-authored-by: Conroy Cheers <conroy@corncheese.org>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: liusiqian-tal <141730978+liusiqian-tal@users.noreply.github.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>
Co-authored-by: XiongfeiWei <isaacwxf23@gmail.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Anna Pendleton <pendleton@google.com>
Co-authored-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: py-andy-c <37168711+py-andy-c@users.noreply.github.com>
Co-authored-by: niu_he <carlton2tang@gmail.com>
Co-authored-by: Junhao Li <junhao@ubicloud.com>
Co-authored-by: leopardracer <136604165+leopardracer@users.noreply.github.com>
Co-authored-by: artetaout <128046886+artetaout@users.noreply.github.com>
Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: runzhen <wangrunzhen@gmail.com>
Co-authored-by: David Xia <david@davidxia.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: wonjun Jang <strutive07@gmail.com>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: mobicham <37179323+mobicham@users.noreply.github.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com>
Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
Co-authored-by: Boyuan Feng <fby.1994@gmail.com>
Co-authored-by: qscqesze <qingjun@minimaxi.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Saheli Bhattacharjee <47847054+sahelib25@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: maobaolong <baoloongmao@tencent.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: quanliu <33453350+quanliu1991@users.noreply.github.com>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: Francesco Bertolotti <f14.bertolotti@gmail.com>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com>
Co-authored-by: Navanit Dubey <98005188+Navanit-git@users.noreply.github.com>
Co-authored-by: Shawn Tan <shawntan@ibm.com>
Co-authored-by: qscqesze <qscqesze@gmail.com>
amogkam added a commit to character-tech/vllm that referenced this pull request Jun 16, 2025
* [Bugfix] disable processor cache  (vllm-project#19068)

Signed-off-by: raushan <raushan@huggingface.co>

* [Doc] Improve the Pull Request template with key components (vllm-project#19086)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Misc] Add missing `_Backend` enums (vllm-project#19081)

Signed-off-by: nicklucche <nlucches@redhat.com>

* [Misc] fix: add miss best_of param validation (vllm-project#18555)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [Misc] Add SPDX-FileCopyrightText  (vllm-project#19100)

Signed-off-by: simon-mo <simon.mo@hey.com>

* [Doc] Readme standardization (vllm-project#18695)

Co-authored-by: Soren Dreano <soren@numind.ai>

* [doc] update docker version (vllm-project#19074)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* [V1] Support cross-layer KV sharing (vllm-project#18212)

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

* [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844)

Signed-off-by: mgoin <mgoin64@gmail.com>

* Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikun@apache.org>

* [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971)

* [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411)

Signed-off-by: nicklucche <nlucches@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

* [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* feat: add data parallel rank to KVEventBatch (vllm-project#18925)

* [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919)

* [Docs] Add developer doc about CI failures (vllm-project#18782)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [CPU] V1 support for the CPU backend (vllm-project#16441)

* [Core] Cast multimodal input in hf processor (vllm-project#18862)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437)

* [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059)

Signed-off-by: calvin chen <120380290@qq.com>

* [NVIDIA] Add Cutlass MLA backend (vllm-project#17625)

* [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* Fix vllm-project#19130 (vllm-project#19132)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [TPU] Skip hanging tests (vllm-project#19115)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

* [Misc] Add packages for benchmark as extra dependency (vllm-project#19089)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Improve the output precision of embedding models (vllm-project#19092)

* [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Add DeepSeek-R1-0528 function call chat template (vllm-project#18874)

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

* Sm100 blockwise fp8 swap ab (vllm-project#18564)

* [Doc] Update V1 Guide for embedding models (vllm-project#19141)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* [Bugfix][EP+DP] Fix internode check (vllm-project#19112)

Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>

* [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [TPU] Update dynamo dump file name in compilation test (vllm-project#19108)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121)

* [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>

* [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [P/D] Heterogeneous TP (vllm-project#18833)

Signed-off-by: nicklucche <nlucches@redhat.com>

* [doc] small fix (vllm-project#19167)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>

* [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117)

* [Torch Nightly]add missing dependency (vllm-project#18770)

Signed-off-by: Yang Wang <elainewy@meta.com>

* Handle non-serializable objects when dumping benchmark results (vllm-project#19114)

* [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Build] Annotate wheel and container path for release workflow (vllm-project#19162)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Frontend] improve vllm run-batch --help display (vllm-project#19187)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202)

Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>

* [mistral_common] Add v11 tokenizer (vllm-project#19193)

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205)

* [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110)

Signed-off-by: Chiyue Wei <chiyuew@nvidia.com>
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com>

* [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226)

Signed-off-by: Povilas Kanapickas <povilas@radix.lt>

* [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090)

* [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217)

* [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118)

* [Model] NemotronH support (vllm-project#18863)

Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>

* Fix AOPerModuleConfig name changes (vllm-project#18869)

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

* [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033)

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

* [v1] Hybrid Memory Allocator (vllm-project#17996)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [TPU] update torch_xla pin (vllm-project#19231)

Signed-off-by: Chengji Yao <chengjiyao@google.com>

* Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143)

Signed-off-by: Xu Song <xusong.vip@gmail.com>

* [Chore] update CODEOWNERS (vllm-project#19247)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182)

Co-authored-by: jinghui <jinghui@fb.com>

* [TPU] fix kv cache dtype in model runner (vllm-project#19244)

Signed-off-by: Chengji Yao <chengjiyao@google.com>

* [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224)

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

* [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172)

Signed-off-by: Nick Hill <nhill@redhat.com>

* Fix CompilationConfig repr (vllm-project#19091)

Signed-off-by: rzou <zou3519@gmail.com>

* Unit Test for run_dp_sharded_vision_model (vllm-project#19103)

Signed-off-by: Siqi Yan <siqi@meta.com>
Co-authored-by: Siqi Yan <siqi@meta.com>

* [Model] Optimize nemotron_h implementation (vllm-project#19249)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* improve logits bias (vllm-project#19041)

* Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422)

Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>

* [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225)

Co-authored-by: Adolfo Victoria <adovi@meta.com>

* [Core] Fix abrupt request abort (vllm-project#18485)

Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>

Co-authored-by: Nick Hill <nhill@redhat.com>

* [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163)

Signed-off-by: Chenyaaang <chenyangli@google.com>

* [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039)

Signed-off-by: Qiliang Cui <derrhein@gmail.com>

* [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253)

Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com>

* Add FlexAttention to V1 (vllm-project#16078)

Signed-off-by: drisspg <drisspguessous@gmail.com>

* [Misc] refactor context extension (vllm-project#19246)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311)

Signed-off-by: Lifan Shen <lifans@meta.com>

* [AMD] Update compatible packaging version (vllm-project#19309)

Signed-off-by: pramkuma <Pramendra.Kumar@amd.com>

* [BugFix][V1] Fix memory profiling bug (vllm-project#18974)

Signed-off-by: luka <luka@neuralmagic.com>

* [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302)

Signed-off-by: rzou <zou3519@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315)

Signed-off-by: Xu Wenqing <xuwq1993@qq.com>

* [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082)

Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>

* [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312)

* [Multi Modal] Add an env var for message queue max chunk bytes  (vllm-project#19242)

Signed-off-by: yZhen <yZhen@fb.com>
Co-authored-by: yZhen <yZhen@fb.com>

* [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201)

* [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Add documentation update reminder to PR template (vllm-project#19289)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Frontend] Remove unreachable code from llm.py (vllm-project#19288)

Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com>

* [Misc] Cleanup compilation tests (vllm-project#19343)

Signed-off-by: rzou <zou3519@gmail.com>

* [doc] improve ci doc (vllm-project#19307)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333)

Signed-off-by: cr7258 <chengzw258@163.com>

* [CI/Build] Fix LoRA test (vllm-project#19350)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328)

Signed-off-by: Conroy Cheers <conroy@corncheese.org>

* [CI] Introduce rules for llama auto-label (vllm-project#19323)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Docs] Fix a bullet list in usage/security.md (vllm-project#19358)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* [full_graph] Fix query_start_loc padding (vllm-project#19321)

Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai>

* [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319)

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>

* [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Quantization] Bump compressed-tensors version (vllm-project#19295)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472)

Signed-off-by: liusiqian <liusiqian@tal.com>

* [TPU]Fix KV cache sharing tests (vllm-project#19371)

* [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374)

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

* [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

* [Bugfix] Fix benchmark_moe.py (vllm-project#19016)

Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>

* Use xla flag to improve the quantized model performance (vllm-project#19303)

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382)

* [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Core] Use tuple for kv cache group block ids (vllm-project#19175)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Bugfix] Fix modelscope token passed in (vllm-project#19389)

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [Core] Batch multi modal input using pinned memory (vllm-project#19169)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* Add security warning to bug report template (vllm-project#19365)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Misc] refactor neuron_multimodal and profiling (vllm-project#19397)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* Add clear documentation around the impact of debugging flag (vllm-project#19369)

Signed-off-by: Anna Pendleton <pendleton@google.com>

* Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930)

Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>

* Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404)

* [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134)

Signed-off-by: Yunqiu Guo <guorachel@meta.com>

* [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411)

Signed-off-by: jiang.li <jiang1.li@intel.com>

* Simplify ep kernels installation (vllm-project#19412)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Slight improvement of the BNB  (vllm-project#19418)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [Docs] Note that alternative structured output backends are supported (vllm-project#19426)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [Model] use AutoWeightsLoader for commandr (vllm-project#19399)

Signed-off-by: py-andy-c <pychen1017@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401)

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

* [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390)

Signed-off-by: rzou <zou3519@gmail.com>

* [New Model]: Support Qwen3 Embedding & Reranker  (vllm-project#19260)

* [BugFix] Fix docker build cpu-dev image error (vllm-project#19394)

Signed-off-by: niu_he <carlton2tang@gmail.com>

* Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451)

Signed-off-by: Lu Fang <lufang@fb.com>

* [CI] Disable failing GGUF model test (vllm-project#19454)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455)

Signed-off-by: Junhao Li <junhao@ubicloud.com>

* Fix Typo in Documentation and Function Name (vllm-project#19442)

* [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Kernel] Support deep_gemm for linear methods (vllm-project#19085)

Signed-off-by: artetaout <lulala341@gmail.com>

* [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Fix quantization link titles (vllm-project#19478)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Support "important" and "announcement" admonitions (vllm-project#19479)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Reduce warning message introduced in env_override (vllm-project#19476)

Signed-off-by: Lu Fang <lufang@fb.com>

* Support non-string values in JSON keys from CLI (vllm-project#19471)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Add cache to cuda get_device_capability (vllm-project#19436)

Signed-off-by: mgoin <mgoin64@gmail.com>

* Fix some typo (vllm-project#19475)

Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>

* Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241)

Signed-off-by: Tsai, Louie <louie.tsai@intel.com>

* [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453)

Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com>

* [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [doc] fix "Other AI accelerators" getting started page (vllm-project#19457)

Signed-off-by: David Xia <david@davidxia.com>

* [Misc] Fix  misleading ROCm warning (vllm-project#19486)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Docs] Remove WIP features in V1 guide (vllm-project#19498)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168)

Signed-off-by: Bill Nell <bnell@redhat.com>

* [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>

* [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [CI] change spell checker from codespell to typos (vllm-project#18711)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518)

Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* [Frontend] Improve error message in tool_choice validation (vllm-project#19239)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522)

Signed-off-by: strutive07 <strutive07@gmail.com>

* [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* Fix typo (vllm-project#19525)

Signed-off-by: 2niuhe <carlton2tang@gmail.com>

* [Security] Prevent new imports of (cloud)pickle (vllm-project#18018)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>

* [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* [Quantization] Improve AWQ logic (vllm-project#19431)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Doc] Add V1 column to supported models list (vllm-project#19523)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1][NixlConnector] Drop `num_blocks` check  (vllm-project#19532)

Signed-off-by: NickLucche <nlucches@redhat.com>

* [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Fix TorchAOConfig skip layers (vllm-project#19265)

Signed-off-by: mobicham <hicham@mobiuslabs.com>

* [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>

* [doc] Make top navigation sticky (vllm-project#19540)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847)

* [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506)

* [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Doc] Unify structured outputs examples (vllm-project#18196)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [V1] Resolve failed concurrent structured output requests (vllm-project#19565)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378)

* [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570)

Signed-off-by: qizixi <qizixi@meta.com>

* [Doc] uses absolute links for structured outputs (vllm-project#19582)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [doc] fix incorrect link (vllm-project#19586)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Misc] Correct broken docs link (vllm-project#19553)

Signed-off-by: Zerohertz <ohg3417@gmail.com>

* [CPU] Refine default config for the CPU backend (vllm-project#19539)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [Fix] bump mistral common to support magistral (vllm-project#19533)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* use base version for version comparison (vllm-project#19587)

Signed-off-by: Boyuan Feng <boyuan@meta.com>

* [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Model] Fix minimax model cache & lm_head precision (vllm-project#19592)

Signed-off-by: qingjun <qingjun@minimaxi.com>

* [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [doc][mkdocs] fix the  duplicate Supported features sections in GPU docs (vllm-project#19606)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

Signed-off-by: luka <luka@neuralmagic.com>

* [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377)

Signed-off-by: Anna Pendleton <pendleton@google.com>

* [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618)

* Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508)

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354)

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

* [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633)

* [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316)

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500)

* Only build CUTLASS MoE kernels on Hopper (vllm-project#19648)

* [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561)

* [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262)

* [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566)

* [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644)

* [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Enable prefix caching with full cuda graphs (vllm-project#19617)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589)

* [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [MISC] Remove unused variableds in C++ (vllm-project#19609)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957)

Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>

* [Misc][Frontend] passthrough `bad_words` (vllm-project#19564)

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>

* [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

* [TPU] support attention head dim smaller than 128 (vllm-project#19620)

Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>

* [MISC] typo fix (vllm-project#19672)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [CI] Add mteb testing for rerank models (vllm-project#19344)

* [Docs] Move multiproc doc to v1 dir (vllm-project#19651)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754)

Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>

* [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557)

* [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652)

Signed-off-by: Shawn Tan <shawntan@ibm.com>

* [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Kernels] Use empty for modular MoE workspaces (vllm-project#19667)

Signed-off-by: Bill Nell <bnell@redhat.com>

* [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677)

Signed-off-by: QscQ <qscqesze@gmail.com>

* [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* fix

Signed-off-by: Amog Kamsetty <amogkamsetty@gmail.com>

* remove logging

Signed-off-by: Amog Kamsetty <amogkamsetty@gmail.com>

---------

Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Lu Fang <lufang@fb.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: googs1025 <googs1025@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: calvin chen <120380290@qq.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
Signed-off-by: Jon Swenson <jmswen@gmail.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Chiyue Wei <chiyuew@nvidia.com>
Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Signed-off-by: Xu Song <xusong.vip@gmail.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: rzou <zou3519@gmail.com>
Signed-off-by: Siqi Yan <siqi@meta.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Chenyaaang <chenyangli@google.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com>
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: Lifan Shen <lifans@meta.com>
Signed-off-by: pramkuma <Pramendra.Kumar@amd.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Xu Wenqing <xuwq1993@qq.com>
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Signed-off-by: yZhen <yZhen@fb.com>
Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com>
Signed-off-by: cr7258 <chengzw258@163.com>
Signed-off-by: Conroy Cheers <conroy@corncheese.org>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: liusiqian <liusiqian@tal.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: Yunqiu Guo <guorachel@meta.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: py-andy-c <pychen1017@gmail.com>
Signed-off-by: niu_he <carlton2tang@gmail.com>
Signed-off-by: Junhao Li <junhao@ubicloud.com>
Signed-off-by: artetaout <lulala341@gmail.com>
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com>
Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com>
Signed-off-by: David Xia <david@davidxia.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: strutive07 <strutive07@gmail.com>
Signed-off-by: 2niuhe <carlton2tang@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: mobicham <hicham@mobiuslabs.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: qingjun <qingjun@minimaxi.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
Signed-off-by: Shawn Tan <shawntan@ibm.com>
Signed-off-by: QscQ <qscqesze@gmail.com>
Signed-off-by: Amog Kamsetty <amogkamsetty@gmail.com>
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: CYJiang <86391540+googs1025@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikun@apache.org>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Calvin Chen <45745657+calvin0327@users.noreply.github.com>
Co-authored-by: Kaixi Hou <kaixih@nvidia.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Xu Wenqing <121550081+Xu-Wenqing@users.noreply.github.com>
Co-authored-by: Lain <fusiyuan2000@hotmail.com>
Co-authored-by: jmswen <jmswen@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Yang Wang <elainewy@meta.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Chiyue Wei <92623189+dubcyfor3@users.noreply.github.com>
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com>
Co-authored-by: Povilas Kanapickas <povilas@radix.lt>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Luis Vega <vegaluisjose@users.noreply.github.com>
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Xu Song <xusong.vip@gmail.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Jinghui Zhang <jinghuizhang0804@gmail.com>
Co-authored-by: jinghui <jinghui@fb.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Siqi Yan <ysq0807@hotmail.com>
Co-authored-by: Siqi Yan <siqi@meta.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Yu Guo <82124926+yuguo68@users.noreply.github.com>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Co-authored-by: Adolfo Victoria <adolfokarim@gmail.com>
Co-authored-by: Adolfo Victoria <adovi@meta.com>
Co-authored-by: Chenyaaang <42742451+Chenyaaang@users.noreply.github.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: QiliangCui <derrhein@gmail.com>
Co-authored-by: Aaruni Aggarwal <47731267+AaruniAggarwal@users.noreply.github.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: Lifans <draftbks@gmail.com>
Co-authored-by: pramenku <7664080+pramenku@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Akash kaothalkar <61960177+Akashcodes732@users.noreply.github.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: jennyyyyzhen <47012288+jennyyyyzhen@users.noreply.github.com>
Co-authored-by: yZhen <yZhen@fb.com>
Co-authored-by: Kseniya Parkhamchuk <43078183+KsuParkhamchuk@users.noreply.github.com>
Co-authored-by: Se7en <chengzw258@163.com>
Co-authored-by: Conroy Cheers <conroy@corncheese.org>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: liusiqian-tal <141730978+liusiqian-tal@users.noreply.github.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>
Co-authored-by: XiongfeiWei <isaacwxf23@gmail.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Anna Pendleton <pendleton@google.com>
Co-authored-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: py-andy-c <37168711+py-andy-c@users.noreply.github.com>
Co-authored-by: niu_he <carlton2tang@gmail.com>
Co-authored-by: Junhao Li <junhao@ubicloud.com>
Co-authored-by: leopardracer <136604165+leopardracer@users.noreply.github.com>
Co-authored-by: artetaout <128046886+artetaout@users.noreply.github.com>
Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: runzhen <wangrunzhen@gmail.com>
Co-authored-by: David Xia <david@davidxia.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: wonjun Jang <strutive07@gmail.com>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: mobicham <37179323+mobicham@users.noreply.github.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com>
Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
Co-authored-by: Boyuan Feng <fby.1994@gmail.com>
Co-authored-by: qscqesze <qingjun@minimaxi.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Saheli Bhattacharjee <47847054+sahelib25@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: maobaolong <baoloongmao@tencent.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: quanliu <33453350+quanliu1991@users.noreply.github.com>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: Francesco Bertolotti <f14.bertolotti@gmail.com>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com>
Co-authored-by: Navanit Dubey <98005188+Navanit-git@users.noreply.github.com>
Co-authored-by: Shawn Tan <shawntan@ibm.com>
Co-authored-by: qscqesze <qscqesze@gmail.com>
amogkam added a commit to character-tech/vllm that referenced this pull request Jun 16, 2025
* [Bugfix] disable processor cache  (vllm-project#19068)

Signed-off-by: raushan <raushan@huggingface.co>

* [Doc] Improve the Pull Request template with key components (vllm-project#19086)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Misc] Add missing `_Backend` enums (vllm-project#19081)

Signed-off-by: nicklucche <nlucches@redhat.com>

* [Misc] fix: add miss best_of param validation (vllm-project#18555)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [Misc] Add SPDX-FileCopyrightText  (vllm-project#19100)

Signed-off-by: simon-mo <simon.mo@hey.com>

* [Doc] Readme standardization (vllm-project#18695)

Co-authored-by: Soren Dreano <soren@numind.ai>

* [doc] update docker version (vllm-project#19074)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* [V1] Support cross-layer KV sharing (vllm-project#18212)

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

* [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844)

Signed-off-by: mgoin <mgoin64@gmail.com>

* Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikun@apache.org>

* [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971)

* [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411)

Signed-off-by: nicklucche <nlucches@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

* [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* feat: add data parallel rank to KVEventBatch (vllm-project#18925)

* [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919)

* [Docs] Add developer doc about CI failures (vllm-project#18782)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [CPU] V1 support for the CPU backend (vllm-project#16441)

* [Core] Cast multimodal input in hf processor (vllm-project#18862)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437)

* [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059)

Signed-off-by: calvin chen <120380290@qq.com>

* [NVIDIA] Add Cutlass MLA backend (vllm-project#17625)

* [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* Fix vllm-project#19130 (vllm-project#19132)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [TPU] Skip hanging tests (vllm-project#19115)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

* [Misc] Add packages for benchmark as extra dependency (vllm-project#19089)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Improve the output precision of embedding models (vllm-project#19092)

* [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Add DeepSeek-R1-0528 function call chat template (vllm-project#18874)

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

* Sm100 blockwise fp8 swap ab (vllm-project#18564)

* [Doc] Update V1 Guide for embedding models (vllm-project#19141)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* [Bugfix][EP+DP] Fix internode check (vllm-project#19112)

Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>

* [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [TPU] Update dynamo dump file name in compilation test (vllm-project#19108)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121)

* [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>

* [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [P/D] Heterogeneous TP (vllm-project#18833)

Signed-off-by: nicklucche <nlucches@redhat.com>

* [doc] small fix (vllm-project#19167)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>

* [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117)

* [Torch Nightly]add missing dependency (vllm-project#18770)

Signed-off-by: Yang Wang <elainewy@meta.com>

* Handle non-serializable objects when dumping benchmark results (vllm-project#19114)

* [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Build] Annotate wheel and container path for release workflow (vllm-project#19162)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Frontend] improve vllm run-batch --help display (vllm-project#19187)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202)

Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>

* [mistral_common] Add v11 tokenizer (vllm-project#19193)

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205)

* [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110)

Signed-off-by: Chiyue Wei <chiyuew@nvidia.com>
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com>

* [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226)

Signed-off-by: Povilas Kanapickas <povilas@radix.lt>

* [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090)

* [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217)

* [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118)

* [Model] NemotronH support (vllm-project#18863)

Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>

* Fix AOPerModuleConfig name changes (vllm-project#18869)

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

* [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033)

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

* [v1] Hybrid Memory Allocator (vllm-project#17996)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [TPU] update torch_xla pin (vllm-project#19231)

Signed-off-by: Chengji Yao <chengjiyao@google.com>

* Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143)

Signed-off-by: Xu Song <xusong.vip@gmail.com>

* [Chore] update CODEOWNERS (vllm-project#19247)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182)

Co-authored-by: jinghui <jinghui@fb.com>

* [TPU] fix kv cache dtype in model runner (vllm-project#19244)

Signed-off-by: Chengji Yao <chengjiyao@google.com>

* [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224)

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

* [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172)

Signed-off-by: Nick Hill <nhill@redhat.com>

* Fix CompilationConfig repr (vllm-project#19091)

Signed-off-by: rzou <zou3519@gmail.com>

* Unit Test for run_dp_sharded_vision_model (vllm-project#19103)

Signed-off-by: Siqi Yan <siqi@meta.com>
Co-authored-by: Siqi Yan <siqi@meta.com>

* [Model] Optimize nemotron_h implementation (vllm-project#19249)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* improve logits bias (vllm-project#19041)

* Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422)

Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>

* [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225)

Co-authored-by: Adolfo Victoria <adovi@meta.com>

* [Core] Fix abrupt request abort (vllm-project#18485)

Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>

Co-authored-by: Nick Hill <nhill@redhat.com>

* [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163)

Signed-off-by: Chenyaaang <chenyangli@google.com>

* [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039)

Signed-off-by: Qiliang Cui <derrhein@gmail.com>

* [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253)

Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com>

* Add FlexAttention to V1 (vllm-project#16078)

Signed-off-by: drisspg <drisspguessous@gmail.com>

* [Misc] refactor context extension (vllm-project#19246)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311)

Signed-off-by: Lifan Shen <lifans@meta.com>

* [AMD] Update compatible packaging version (vllm-project#19309)

Signed-off-by: pramkuma <Pramendra.Kumar@amd.com>

* [BugFix][V1] Fix memory profiling bug (vllm-project#18974)

Signed-off-by: luka <luka@neuralmagic.com>

* [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302)

Signed-off-by: rzou <zou3519@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315)

Signed-off-by: Xu Wenqing <xuwq1993@qq.com>

* [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082)

Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>

* [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312)

* [Multi Modal] Add an env var for message queue max chunk bytes  (vllm-project#19242)

Signed-off-by: yZhen <yZhen@fb.com>
Co-authored-by: yZhen <yZhen@fb.com>

* [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201)

* [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Add documentation update reminder to PR template (vllm-project#19289)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Frontend] Remove unreachable code from llm.py (vllm-project#19288)

Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com>

* [Misc] Cleanup compilation tests (vllm-project#19343)

Signed-off-by: rzou <zou3519@gmail.com>

* [doc] improve ci doc (vllm-project#19307)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333)

Signed-off-by: cr7258 <chengzw258@163.com>

* [CI/Build] Fix LoRA test (vllm-project#19350)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328)

Signed-off-by: Conroy Cheers <conroy@corncheese.org>

* [CI] Introduce rules for llama auto-label (vllm-project#19323)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Docs] Fix a bullet list in usage/security.md (vllm-project#19358)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* [full_graph] Fix query_start_loc padding (vllm-project#19321)

Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai>

* [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319)

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>

* [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Quantization] Bump compressed-tensors version (vllm-project#19295)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472)

Signed-off-by: liusiqian <liusiqian@tal.com>

* [TPU]Fix KV cache sharing tests (vllm-project#19371)

* [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374)

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

* [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

* [Bugfix] Fix benchmark_moe.py (vllm-project#19016)

Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>

* Use xla flag to improve the quantized model performance (vllm-project#19303)

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382)

* [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Core] Use tuple for kv cache group block ids (vllm-project#19175)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Bugfix] Fix modelscope token passed in (vllm-project#19389)

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [Core] Batch multi modal input using pinned memory (vllm-project#19169)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* Add security warning to bug report template (vllm-project#19365)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Misc] refactor neuron_multimodal and profiling (vllm-project#19397)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* Add clear documentation around the impact of debugging flag (vllm-project#19369)

Signed-off-by: Anna Pendleton <pendleton@google.com>

* Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930)

Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>

* Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404)

* [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134)

Signed-off-by: Yunqiu Guo <guorachel@meta.com>

* [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411)

Signed-off-by: jiang.li <jiang1.li@intel.com>

* Simplify ep kernels installation (vllm-project#19412)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Slight improvement of the BNB  (vllm-project#19418)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [Docs] Note that alternative structured output backends are supported (vllm-project#19426)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [Model] use AutoWeightsLoader for commandr (vllm-project#19399)

Signed-off-by: py-andy-c <pychen1017@gmail.com>

* Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401)

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

* [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390)

Signed-off-by: rzou <zou3519@gmail.com>

* [New Model]: Support Qwen3 Embedding & Reranker  (vllm-project#19260)

* [BugFix] Fix docker build cpu-dev image error (vllm-project#19394)

Signed-off-by: niu_he <carlton2tang@gmail.com>

* Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451)

Signed-off-by: Lu Fang <lufang@fb.com>

* [CI] Disable failing GGUF model test (vllm-project#19454)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455)

Signed-off-by: Junhao Li <junhao@ubicloud.com>

* Fix Typo in Documentation and Function Name (vllm-project#19442)

* [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Kernel] Support deep_gemm for linear methods (vllm-project#19085)

Signed-off-by: artetaout <lulala341@gmail.com>

* [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Fix quantization link titles (vllm-project#19478)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Support "important" and "announcement" admonitions (vllm-project#19479)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Reduce warning message introduced in env_override (vllm-project#19476)

Signed-off-by: Lu Fang <lufang@fb.com>

* Support non-string values in JSON keys from CLI (vllm-project#19471)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Add cache to cuda get_device_capability (vllm-project#19436)

Signed-off-by: mgoin <mgoin64@gmail.com>

* Fix some typo (vllm-project#19475)

Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>

* Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241)

Signed-off-by: Tsai, Louie <louie.tsai@intel.com>

* [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453)

Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com>

* [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [doc] fix "Other AI accelerators" getting started page (vllm-project#19457)

Signed-off-by: David Xia <david@davidxia.com>

* [Misc] Fix  misleading ROCm warning (vllm-project#19486)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Docs] Remove WIP features in V1 guide (vllm-project#19498)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168)

Signed-off-by: Bill Nell <bnell@redhat.com>

* [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>

* [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [CI] change spell checker from codespell to typos (vllm-project#18711)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518)

Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* [Frontend] Improve error message in tool_choice validation (vllm-project#19239)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522)

Signed-off-by: strutive07 <strutive07@gmail.com>

* [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* Fix typo (vllm-project#19525)

Signed-off-by: 2niuhe <carlton2tang@gmail.com>

* [Security] Prevent new imports of (cloud)pickle (vllm-project#18018)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>

* [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503)

Signed-off-by: Jon Swenson <jmswen@gmail.com>

* [Quantization] Improve AWQ logic (vllm-project#19431)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Doc] Add V1 column to supported models list (vllm-project#19523)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1][NixlConnector] Drop `num_blocks` check  (vllm-project#19532)

Signed-off-by: NickLucche <nlucches@redhat.com>

* [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Fix TorchAOConfig skip layers (vllm-project#19265)

Signed-off-by: mobicham <hicham@mobiuslabs.com>

* [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>

* [doc] Make top navigation sticky (vllm-project#19540)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847)

* [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506)

* [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Doc] Unify structured outputs examples (vllm-project#18196)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [V1] Resolve failed concurrent structured output requests (vllm-project#19565)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378)

* [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

* [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570)

Signed-off-by: qizixi <qizixi@meta.com>

* [Doc] uses absolute links for structured outputs (vllm-project#19582)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [doc] fix incorrect link (vllm-project#19586)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Misc] Correct broken docs link (vllm-project#19553)

Signed-off-by: Zerohertz <ohg3417@gmail.com>

* [CPU] Refine default config for the CPU backend (vllm-project#19539)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [Fix] bump mistral common to support magistral (vllm-project#19533)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* use base version for version comparison (vllm-project#19587)

Signed-off-by: Boyuan Feng <boyuan@meta.com>

* [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Model] Fix minimax model cache & lm_head precision (vllm-project#19592)

Signed-off-by: qingjun <qingjun@minimaxi.com>

* [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [doc][mkdocs] fix the  duplicate Supported features sections in GPU docs (vllm-project#19606)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

Signed-off-by: luka <luka@neuralmagic.com>

* [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377)

Signed-off-by: Anna Pendleton <pendleton@google.com>

* [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618)

* Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508)

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354)

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

* [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633)

* [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316)

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500)

* Only build CUTLASS MoE kernels on Hopper (vllm-project#19648)

* [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561)

* [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262)

* [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566)

* [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644)

* [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339)

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

* [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Enable prefix caching with full cuda graphs (vllm-project#19617)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589)

* [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [MISC] Remove unused variableds in C++ (vllm-project#19609)

Signed-off-by: Lu Fang <lufang@fb.com>

* [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957)

Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>

* [Misc][Frontend] passthrough `bad_words` (vllm-project#19564)

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>

* [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

* [TPU] support attention head dim smaller than 128 (vllm-project#19620)

Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>

* [MISC] typo fix (vllm-project#19672)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [CI] Add mteb testing for rerank models (vllm-project#19344)

* [Docs] Move multiproc doc to v1 dir (vllm-project#19651)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754)

Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>

* [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557)

* [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652)

Signed-off-by: Shawn Tan <shawntan@ibm.com>

* [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Kernels] Use empty for modular MoE workspaces (vllm-project#19667)

Signed-off-by: Bill Nell <bnell@redhat.com>

* [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677)

Signed-off-by: QscQ <qscqesze@gmail.com>

* [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* fix

Signed-off-by: Amog Kamsetty <amogkamsetty@gmail.com>

---------

Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Lu Fang <lufang@fb.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: googs1025 <googs1025@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: Varun <vsundarr@redhat.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: calvin chen <120380290@qq.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
Signed-off-by: Jon Swenson <jmswen@gmail.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Chiyue Wei <chiyuew@nvidia.com>
Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Signed-off-by: Xu Song <xusong.vip@gmail.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: rzou <zou3519@gmail.com>
Signed-off-by: Siqi Yan <siqi@meta.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Chenyaaang <chenyangli@google.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com>
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: Lifan Shen <lifans@meta.com>
Signed-off-by: pramkuma <Pramendra.Kumar@amd.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Xu Wenqing <xuwq1993@qq.com>
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Signed-off-by: yZhen <yZhen@fb.com>
Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com>
Signed-off-by: cr7258 <chengzw258@163.com>
Signed-off-by: Conroy Cheers <conroy@corncheese.org>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: liusiqian <liusiqian@tal.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: Yunqiu Guo <guorachel@meta.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: py-andy-c <pychen1017@gmail.com>
Signed-off-by: niu_he <carlton2tang@gmail.com>
Signed-off-by: Junhao Li <junhao@ubicloud.com>
Signed-off-by: artetaout <lulala341@gmail.com>
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com>
Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com>
Signed-off-by: David Xia <david@davidxia.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: strutive07 <strutive07@gmail.com>
Signed-off-by: 2niuhe <carlton2tang@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: mobicham <hicham@mobiuslabs.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: qingjun <qingjun@minimaxi.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
Signed-off-by: Shawn Tan <shawntan@ibm.com>
Signed-off-by: QscQ <qscqesze@gmail.com>
Signed-off-by: Amog Kamsetty <amogkamsetty@gmail.com>
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: CYJiang <86391540+googs1025@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikun@apache.org>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Calvin Chen <45745657+calvin0327@users.noreply.github.com>
Co-authored-by: Kaixi Hou <kaixih@nvidia.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Xu Wenqing <121550081+Xu-Wenqing@users.noreply.github.com>
Co-authored-by: Lain <fusiyuan2000@hotmail.com>
Co-authored-by: jmswen <jmswen@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Yang Wang <elainewy@meta.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Chiyue Wei <92623189+dubcyfor3@users.noreply.github.com>
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com>
Co-authored-by: Povilas Kanapickas <povilas@radix.lt>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Luis Vega <vegaluisjose@users.noreply.github.com>
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Xu Song <xusong.vip@gmail.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Jinghui Zhang <jinghuizhang0804@gmail.com>
Co-authored-by: jinghui <jinghui@fb.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Siqi Yan <ysq0807@hotmail.com>
Co-authored-by: Siqi Yan <siqi@meta.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Yu Guo <82124926+yuguo68@users.noreply.github.com>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
Co-authored-by: Adolfo Victoria <adolfokarim@gmail.com>
Co-authored-by: Adolfo Victoria <adovi@meta.com>
Co-authored-by: Chenyaaang <42742451+Chenyaaang@users.noreply.github.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: QiliangCui <derrhein@gmail.com>
Co-authored-by: Aaruni Aggarwal <47731267+AaruniAggarwal@users.noreply.github.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: Lifans <draftbks@gmail.com>
Co-authored-by: pramenku <7664080+pramenku@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Akash kaothalkar <61960177+Akashcodes732@users.noreply.github.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: jennyyyyzhen <47012288+jennyyyyzhen@users.noreply.github.com>
Co-authored-by: yZhen <yZhen@fb.com>
Co-authored-by: Kseniya Parkhamchuk <43078183+KsuParkhamchuk@users.noreply.github.com>
Co-authored-by: Se7en <chengzw258@163.com>
Co-authored-by: Conroy Cheers <conroy@corncheese.org>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: liusiqian-tal <141730978+liusiqian-tal@users.noreply.github.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>
Co-authored-by: XiongfeiWei <isaacwxf23@gmail.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Anna Pendleton <pendleton@google.com>
Co-authored-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: py-andy-c <37168711+py-andy-c@users.noreply.github.com>
Co-authored-by: niu_he <carlton2tang@gmail.com>
Co-authored-by: Junhao Li <junhao@ubicloud.com>
Co-authored-by: leopardracer <136604165+leopardracer@users.noreply.github.com>
Co-authored-by: artetaout <128046886+artetaout@users.noreply.github.com>
Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: runzhen <wangrunzhen@gmail.com>
Co-authored-by: David Xia <david@davidxia.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: wonjun Jang <strutive07@gmail.com>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: mobicham <37179323+mobicham@users.noreply.github.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com>
Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
Co-authored-by: Boyuan Feng <fby.1994@gmail.com>
Co-authored-by: qscqesze <qingjun@minimaxi.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Saheli Bhattacharjee <47847054+sahelib25@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: maobaolong <baoloongmao@tencent.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: quanliu <33453350+quanliu1991@users.noreply.github.com>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: Francesco Bertolotti <f14.bertolotti@gmail.com>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com>
Co-authored-by: Navanit Dubey <98005188+Navanit-git@users.noreply.github.com>
Co-authored-by: Shawn Tan <shawntan@ibm.com>
Co-authored-by: qscqesze <qscqesze@gmail.com>
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
…lm-project#19298)

Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun <vsundarr@redhat.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants