-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[V1][Spec Decode] Eagle Model loading #16035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
I'm sure this has been discussed elsewhere, but why introduce a V1-specific model ( e.g. it looked like @luyuzhe111 expected |
# We need to set the vllm_config here to register attention | ||
# layers in the forward context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call load_model()
from the __init__()
so that this function runs and the attn layer are registered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit on which load_model
are you talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one in this file spec_decode/eagle.py
here: https://github.com/vllm-project/vllm/pull/16035/files/59ee450306d3d719f78ad60c77ba9b739bc5cb11#diff-a4809a837fbf535a8f0999b11087a53ec1c53948b50c0a1fe64396bc86de9461R184
# We need to set the vllm_config here to register attention | ||
# layers in the forward context. | ||
with set_default_torch_dtype( | ||
draft_model_config.dtype), set_current_vllm_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I didnt get this part which says that setting the current vllm config will lead to registering attn layers. Can you pls share more?
My understanding is that, we did not change any self.vllm_config
in this function and the attn are registered when the model is initialized where it saves the attn prefix in static_forward_context which is then used during bind_kv_cache()
.
- So if the
load_model()
is called in__init__
in this file then would it not register the attn func without the need ofset_current_vllm_config(self.vllm_config)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which load_model
are you referring to? This one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one in spec_decode/eagle.py here: https://github.com/vllm-project/vllm/pull/16035/files/59ee450306d3d719f78ad60c77ba9b739bc5cb11#diff-a4809a837fbf535a8f0999b11087a53ec1c53948b50c0a1fe64396bc86de9461R184
I have broken my above question into 2 parts along with my understanding so that it is easier for you to explain what I am missing. Looking fwd to your response
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Hi folks, thanks for all the comments so far @markmc @ekagra-ranjan, I am double checking the correctness and have not started fixed the comments. I will start fixing comments tomorrow. Some updates:
Please start review and check correctness. cc @luyuzhe111 @WoosukKwon. |
@LiuXiaoxuanPKU Good results! I was wondering how are we able to run EAGLE give Task 2, 3 are in #15901 are WIP ? What are the implications/assumptions of these wrt to the results shared in this PR? |
Thanks for asking:
|
@LiuXiaoxuanPKU - are results on gsm8k? If possible can you run them on MTBench so that we can compare the results with SGL and identify gaps? https://docs.google.com/document/d/18ETJLsnxR88Qq3VDk5Mq-Hb7vuE9o3VNZ-hhz-OqAXk/edit?usp=sharing |
@LiuXiaoxuanPKU Hi Lily, when running with |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
I benchmarked this PR on MTBench using the I wired up the SD metrics in Scheduler in EngineCoreOutputs so that we can test the correctness with the Accept Length. Currently, the SD metrics in Scheduler gets reinitialized every engine step and is not aggregated so I fix that. Here is a dummy PR which has the benchmarking script and changes I did to get the below results on top of this PR. Pls lmk if something is incorrect in my setup. I can also raise a PR with the SD metric if that makes sense with some extra steps. Here is the cmd used
k=2 is 36% faster and k=4 is 28% faster than vanilla. Compared to the SGL bench:
vllm AL formula is here Trends:
I feel the formulas for AL are similar and that shouldn't be the cause of differences in the AL but we are getting lower AL for K=4. Pls lmk your thoughts or if I missed something. |
@ekagra-ranjan This PR itself is not enough to support eagle correctly. We need to handle the draft model's KV cache properly. |
@ekagra-ranjan Hi Ekagra, Thanks for sharing the benchmarking results! I can confirm the acceptance length you collected for vllm should be accurate, as I aggregate acceptance length from request-level, per-step acceptance counts in this PR #16367. I suspect you might have used EAGLE-2 in SGLang, which explains the larger acceptance length for k = 4. Can you double check on that? I will include the acceptance length comparison of vLLM against the EAGLE-1 repo soon. Regardless, I think we should merge this PR soon to unblock further developments. Without this PR, we can't even debug : ) |
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
@luyuzhe111 - thank you for your response. In SGL, I am using chain based draft where |
Hi @ekagra-ranjan, if it's indeed chain drafts, then I don't think the acceptance length in SGL makes any sense? Basically the result is saying, the first two draft positions have 0.72 tokens accepted, and the next two draft positions also have 0.68 tokens accepted. Even in the best case scenario (mean number of accepted tokens accepted at each step = [1, 0.37, 0.35, 0.34, 0.33], assuming minimal acceptance rate drop between draft steps), it's impossible to have 0.68 tokens accepted at the third and fourth positions combined. |
For reference, these are the acceptance length comparisons between EAGLE repo, vLLM v0, and vLLM v1. On MT Bench, assuming single-draft. Acceptance length is computed using #16367 When max number generated tokens = 256
When max number generated tokens = 512
Observations:
Hope the numbers here from the EAGLE repo can serve as a reference for future development efforts. cc @LiuXiaoxuanPKU @WoosukKwon |
HI @luyuzhe111 , Great work! Nice to see such an early benchmarking!
IIUC, this PR is intended to allocate cache slot for the draft model, which should not affect the acceptance rate? If the implementation is good, I assume it should be more related to the sampling/rejection method. And there indeed has some gaps, like https://github.com/vllm-project/vllm/blob/main/vllm/v1/spec_decode/eagle.py#L219. Maybe could you double-check whether the default sampling parameter consistent with reported in the original EAGLE? |
I have the same understanding. Can someone please share why #16370 would improve correctness? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR!
@wwl2755 @ekagra-ranjan Hi Wenlong and Ekagra, thanks for the comments. I actually don't understand #16370 too well and was hoping to dive deeper. Maybe we can keep the discussion under that PR now that this PR is merged? Regarding the hypothesis on acceptance mechanism, I don't think sampling parameter is the issue since I used greedy sampling for both EAGLE repo and vLLM. |
@luyuzhe111 Thanks for sharing your observation! The issue was that SGL uses Using 1 draft token SGL gets 1.72 AL |
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
* [Docs] Add Ollama meetup slides (vllm-project#15905) Signed-off-by: simon-mo <simon.mo@hey.com> * [Docs] Add Intel as Sponsor (vllm-project#15913) Signed-off-by: simon-mo <simon.mo@hey.com> * [Spec Decode] Fix input triton kernel for eagle (vllm-project#15909) * [V1] Fix: make sure `k_index` is int64 for `apply_top_k_only` (vllm-project#15907) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Bugfix] Fix imports for MoE on CPU (vllm-project#15841) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [V1][Minor] Enhance SpecDecoding Metrics Log in V1 (vllm-project#15902) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Doc] Update rocm.inc.md (vllm-project#15917) Signed-off-by: chun37 <chun.jb.37@gmail.com> * [V1][Bugfix] Fix typo in MoE TPU checking (vllm-project#15927) Signed-off-by: Roger Wang <ywang@roblox.com> * [Benchmark]Fix error message (vllm-project#15866) Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> * [Misc] Replace print with logger (vllm-project#15923) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [CI/Build] Further clean up LoRA tests (vllm-project#15920) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Bugfix] Fix cache block size calculation for CPU MLA (vllm-project#15848) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [Build/CI] Update lm-eval to 0.4.8 (vllm-project#15912) Signed-off-by: Chris Thi <chris.c.thi@gmail.com> * [Kernel] Add more dtype support for GGUF dequantization (vllm-project#15879) Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com> * [core] Add tags parameter to wake_up() (vllm-project#15500) Signed-off-by: Eric <erictang000@gmail.com> * [V1] Fix json_object support with xgrammar (vllm-project#15488) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Add minimum version for `huggingface_hub` to enable Xet downloads (vllm-project#15873) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][Benchmarks] Ensure `async_request_deepspeed_mii` uses the OpenAI choices key (vllm-project#15926) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [CI] Remove duplicate entrypoints-test (vllm-project#15940) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * [Bugfix] Fix the issue where the model name is empty string, causing no response with the model name. (vllm-project#15938) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Metrics] Hide deprecated metrics (vllm-project#15458) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Frontend] Implement Tool Calling with `tool_choice='required'` (vllm-project#13483) Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Matt, Matthias <matthias.matt@tuwien.ac.at> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: mgoin <michael@neuralmagic.com> * [CPU][Bugfix] Using custom allreduce for CPU backend (vllm-project#15934) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Model] use AutoWeightsLoader in model load_weights (vllm-project#15770) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Misc] V1 LoRA support CPU offload (vllm-project#15843) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Restricted cmake to be less than version 4 as 4.x breaks the build of… (vllm-project#15859) Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> * [misc] instruct pytorch to use nvml-based cuda check (vllm-project#15951) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Support Mistral3 in V1 (vllm-project#15950) Signed-off-by: mgoin <mgoin64@gmail.com> * Fix `huggingface-cli[hf-xet]` -> `huggingface-cli[hf_xet]` (vllm-project#15969) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1][TPU] TPU-optimized top-p implementation (avoids scattering). (vllm-project#15736) Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> * [TPU] optimize the all-reduce performance (vllm-project#15903) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [V1][TPU] Do not compile sampling more than needed (vllm-project#15883) Signed-off-by: NickLucche <nlucches@redhat.com> * [ROCM][KERNEL] Paged attention for V1 (vllm-project#15720) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com> * fix: better error message for get_config close vllm-project#13889 (vllm-project#15943) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [bugfix] add seed in torchrun_example.py (vllm-project#15980) Signed-off-by: youkaichao <youkaichao@gmail.com> * [ROCM][V0] PA kennel selection when no sliding window provided (vllm-project#15982) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * [Benchmark] Add AIMO Dataset to Benchmark (vllm-project#15955) Signed-off-by: Ziji Shi <shi.ziji.sm@gmail.com> Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com> * [misc] improve error message for "Failed to infer device type" (vllm-project#15994) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (vllm-project#15367) Signed-off-by: wwl2755 <wangwenlong2755@gmail.com> * [doc] update contribution link (vllm-project#15922) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * fix: tiny fix make format.sh excutable (vllm-project#16015) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [SupportsQuant] Bert, Blip, Blip2, Bloom (vllm-project#15573) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * [SupportsQuant] Chameleon, Chatglm, Commandr (vllm-project#15952) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * [Neuron][kernel] Fuse kv cache into a single tensor (vllm-project#15911) Signed-off-by: Liangfu Chen <liangfc@amazon.com> * [Minor] Fused experts refactor (vllm-project#15914) Signed-off-by: Bill Nell <bnell@redhat.com> * [Misc][Performance] Advance tpu.txt to the most recent nightly torch … (vllm-project#16024) * Re-enable the AMD Testing for the passing tests. (vllm-project#15586) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> * [TPU] Switch Test to Non-Sliding Window (vllm-project#15981) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> * [Bugfix] Fix function names in test_block_fp8.py (vllm-project#16033) Signed-off-by: Bill Nell <bnell@redhat.com> * [ROCm] Tweak the benchmark script to run on ROCm (vllm-project#14252) * [Misc] improve gguf check (vllm-project#15974) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [TPU][V1] Remove ragged attention kernel parameter hard coding (vllm-project#16041) Signed-off-by: Chengji Yao <chengjiyao@google.com> * doc: add info for macos clang errors (vllm-project#16049) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [V1][Spec Decode] Avoid logging useless nan metrics (vllm-project#16023) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt (vllm-project#15939) Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com> * [Hardware][Gaudi][BugFix] fix arguments of hpu fused moe (vllm-project#15945) Signed-off-by: zhenwei <zhenweiliu@habana.ai> * [Bugfix][kernels] Fix half2float conversion in gguf kernels (vllm-project#15995) Signed-off-by: Isotr0py <2037008807@qq.com> * [Benchmark][Doc] Update throughput benchmark and README (vllm-project#15998) Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [CPU] Change default block_size for CPU backend (vllm-project#16002) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Distributed] [ROCM] Fix custom allreduce enable checks (vllm-project#16010) Signed-off-by: ilmarkov <imarkov@redhat.com> Co-authored-by: ilmarkov <imarkov@redhat.com> * [ROCm][Bugfix] Use platform specific FP8 dtype (vllm-project#15717) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [ROCm][Bugfix] Bring back fallback to eager mode removed in vllm-project#14917, but for ROCm only (vllm-project#15413) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Bugfix] Fix default behavior/fallback for pp in v1 (vllm-project#16057) Signed-off-by: mgoin <mgoin64@gmail.com> * [CI] Reorganize .buildkite directory (vllm-project#16001) Signed-off-by: kevin <kevin@anyscale.com> * [V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (vllm-project#15906) Signed-off-by: Nick Hill <nhill@redhat.com> * [V1] Scatter and gather placeholders in the model runner (vllm-project#15712) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com> * Revert "[V1] Scatter and gather placeholders in the model runner" (vllm-project#16075) * [Kernel][Minor] Re-fuse triton moe weight application (vllm-project#16071) Signed-off-by: Bill Nell <bnell@redhat.com> * [Bugfix][TPU] Fix V1 TPU worker for sliding window (vllm-project#16059) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [V1][Spec Decode] Update N-gram Proposer Interface (vllm-project#15750) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Misc] Auto detect bitsandbytes pre-quantized models (vllm-project#16027) Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> * [CI] Fix benchmark script level (vllm-project#16089) * fix: support clang17 for macos and fix the real libomp (vllm-project#16086) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [doc] fix 404 (vllm-project#16082) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * Revert "doc: add info for macos clang errors (vllm-project#16049)" (vllm-project#16091) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * Fix some capitalisations in generated examples doc titles (vllm-project#16094) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Misc] format output for encoder_decoder.py (vllm-project#16095) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Misc] Remove redundant code (vllm-project#16098) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine (vllm-project#15946) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> * [Model] use AutoWeightsLoader for phi, gemma, deepseek (vllm-project#16088) Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com> * [Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 (vllm-project#16112) Signed-off-by: Lu Fang <fanglu@fb.com> * [Benchmark] Add sampling parameters to benchmark_serving. (vllm-project#16022) Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> * [Frontend] Fix typo in tool chat templates for llama3.2 and toolace (vllm-project#14501) Signed-off-by: Ben Jackson <ben@ben.com> * [CI][V1] Fix passing `tokenizer` as kwarg to `validate_guidance_grammar` (vllm-project#16117) Signed-off-by: Roger Wang <ywang@roblox.com> * [Misc] refactor example eagle (vllm-project#16100) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Doc][Bugfix] Add missing EOF in k8s deploy doc (vllm-project#16025) * [Misc] Improve model redirect to accept json dictionary (vllm-project#16119) Signed-off-by: Isotr0py <2037008807@qq.com> * [Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 (vllm-project#16103) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Bugfix] LoRA : Fix the order in which the kernels process LoRAs (vllm-project#16040) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix] add hf_token to EngineArgs (vllm-project#16093) Signed-off-by: paolovic <paul-philipp.luley@uzh.ch> Co-authored-by: paolovic <paul-philipp.luley@uzh.ch> * [Misc] update requires-python in pyproject.toml (vllm-project#16116) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [TPU] Update PyTorch/XLA (vllm-project#16130) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [V1][Minor] Optimize get_cached_block (vllm-project#16135) * Fix requires-python (vllm-project#16132) * [Metrics] Add bucket for `request_latency`, `time_to_first_token` and `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * [V1][Minor] Minor simplification for get_computed_blocks (vllm-project#16139) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Misc] Update Mistral-3.1 example (vllm-project#16147) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Make dummy encoder prompt padding alternative and add missing warnings (vllm-project#16129) Signed-off-by: Isotr0py <2037008807@qq.com> * [CI] Set max transformers version for Ultravox model test (vllm-project#16149) Signed-off-by: Roger Wang <ywang@roblox.com> * doc: fix some typos in doc (vllm-project#16154) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [VLM] Florence-2 supports online serving (vllm-project#16164) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1][Structured Output] Add `supports_structured_output()` method to Platform (vllm-project#16148) Signed-off-by: shen-shanshan <467638484@qq.com> * [Model] Add Qwen3 and Qwen3MoE (vllm-project#15289) Signed-off-by: YamPengLi <yampayne.lyp@alibaba-inc.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Misc] improve example mlpspeculator and llm_engine_example (vllm-project#16175) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Doc]Update image to latest version (vllm-project#16186) Signed-off-by: WangErXiao <863579016@qq.com> * Upstream Llama4 Support to Main (vllm-project#16113) Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Signed-off-by: Chris Thi <chris.c.thi@gmail.com> Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Jon Swenson <jmswen@gmail.com> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Xiaodong Wang <xdwang@meta.com> Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Lu Fang <fanglu@fb.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Re-enable support for `ChatGLMForConditionalGeneration` (vllm-project#16187) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] Revert the default `max_num_seqs` to V0 values for most hardware (vllm-project#16158) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Print the warning only once (vllm-project#16193) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Misc] Human-readable `max-model-len` cli arg (vllm-project#16181) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [Misc] Move Llama 4 projector call into encoder execution (vllm-project#16201) * [Bugfix] Fix guidance backend for Qwen models (vllm-project#16210) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> * [V1][BugFix] Exit properly if engine core fails during startup (vllm-project#16137) Signed-off-by: Nick Hill <nhill@redhat.com> * [Misc] add description attribute in CLI (vllm-project#15921) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix][V0] XGrammar structured output supports Enum (vllm-project#15878) Signed-off-by: Leon Seidel <leon.seidel@fau.de> * Torchao (vllm-project#14231) Signed-off-by: drisspg <drisspguessous@gmail.com> * [ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping (vllm-project#16031) Signed-off-by: mgoin <michael@neuralmagic.com> * [core] do not send error across process (vllm-project#16174) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Update compressed-tensors to version 0.9.3 (vllm-project#16196) Signed-off-by: Miles Williams <42222518+mlsw@users.noreply.github.com> * Update BASE_IMAGE to 2.22 release of Neuron (vllm-project#16218) * [V1] Scatter and gather placeholders in the model runner (vllm-project#16076) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> * [Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 (vllm-project#16161) * Add warning for Attention backends that do not support irope yet (vllm-project#16212) * [Bugfix] Do not skip "empty" parts of chats that are parsable (vllm-project#16219) Signed-off-by: mgoin <mgoin64@gmail.com> * [Bugfix] Fix and reorganize broken GGUF tests and bump gguf version (vllm-project#16194) Signed-off-by: Isotr0py <2037008807@qq.com> * [torch.compile][TPU] Make @support_torch_compile work for XLA backend (vllm-project#15782) Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill (vllm-project#15837) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] Merge the logs of pp layers partitions (vllm-project#16225) Signed-off-by: Kebe <mail@kebe7jun.com> * [Docs] Add Slides from Singapore Meetup (vllm-project#16213) Signed-off-by: simon-mo <simon.mo@hey.com> * [Misc] format and refactor some examples (vllm-project#16252) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Misc] Add warning for multimodal data in LLM.beam_search (vllm-project#16241) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * [Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe (vllm-project#16203) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm (vllm-project#16247) Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * [Bugfix] Remove triton do_bench fast_flush arg (vllm-project#16256) Signed-off-by: Kebe <mail@kebe7jun.com> * Update to transformers==4.51.1 (vllm-project#16257) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [New Model]: jinaai/jina-embeddings-v3 (vllm-project#16120) * [Misc] Avoid stripping meaningful whitespace from `nvidia-smi topo -m` output in collect_env.py (vllm-project#16272) Signed-off-by: imkero <kerorek@outlook.com> * [Bugfix] Proper input validation for multi-modal encoder-decoder models (vllm-project#16156) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Handle `process_weights_after_loading` for `QKVCrossParallelLinear` (vllm-project#15328) Signed-off-by: Isotr0py <2037008807@qq.com> * Add warning that content below line in template will be removed (vllm-project#16276) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (vllm-project#16209) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Bugfix] fix deepseek fp16 scale bug (vllm-project#14809) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [V1] Update structured output offline inference example (vllm-project#15721) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [CI/Build] Fix CI LoRA failure (vllm-project#16270) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Add support to modelopt quantization of Mixtral model (vllm-project#15961) Signed-off-by: Yue <yueshen@nvidia.com> * [Model] Add smolvlm support (vllm-project#16017) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs (vllm-project#16198) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> * [Bugfix] fix gettid method is not define (vllm-project#16084) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Feature] Estimate max-model-len use available KV cache memory (vllm-project#16168) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Core] Upgrade to xgrammar 0.1.18, add cache size limit (vllm-project#16283) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding (vllm-project#16221) Signed-off-by: mgoin <mgoin64@gmail.com> * [TPU] Update PyTorch/XLA (vllm-project#16288) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [BugFix] Fix fusion test and add them to CI (vllm-project#16287) Signed-off-by: luka <luka@neuralmagic.com> * [Misc] Fix test_sharded_state_loader.py(vllm-project#16004) (vllm-project#16005) Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com> * [Bugfix] Avoid transferring cached multi-modal items from P0 to P1 (vllm-project#16273) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Update label-tpu mergify and remove removal bot (vllm-project#16298) * [BugFix] logger is not callable (vllm-project#16312) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [BugFix] llama4 qknorm should be not shared across head (vllm-project#16311) Signed-off-by: Lu Fang <fanglu@fb.com> * update neuron config (vllm-project#16289) Signed-off-by: Ajay Vohra <ajayvohr@amazon.com> * [BugFix] fix some typos found by typos. (vllm-project#16314) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [Model] Add `SupportsMultiModal.get_language_model` interface (vllm-project#16007) Signed-off-by: NickLucche <nlucches@redhat.com> * [Bugfix][Frontend] respect provided default guided decoding backend (vllm-project#15476) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> * Revert "Update label-tpu mergify and remove removal bot" (vllm-project#16350) * [Bugfix] Fix profiling.py (vllm-project#16202) Signed-off-by: zh Wang <rekind133@outlook.com> * [Bugfix] catch AssertionError in MistralTokenizer as ValueError (vllm-project#16344) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> * [CI]Fix hpu docker and numpy version for CI (vllm-project#16355) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Fix `benchmark_throughput.py --backend=hf` (vllm-project#16352) Signed-off-by: mgoin <mgoin64@gmail.com> * [Build/CI] Add tracing deps to vllm container image (vllm-project#15224) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Hardware] add platform-specific request validation api (vllm-project#16291) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> * [Misc] refactor Structured Outputs example (vllm-project#16322) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues (vllm-project#16275) Signed-off-by: Chengji Yao <chengjiyao@google.com> * Add GLM-4-0414 support (vllm-project#16338) Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com> Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: yihong0618 <zouzou0208@gmail.com> Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Ajay Vohra <ajayvohr@amazon.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: yihong <zouzou0208@gmail.com> Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com> Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com> * [Bugfix]: do not shutdown server if `skip_special_use=False` for MistralTokenizer (vllm-project#14094) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> * [Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral (vllm-project#16325) Signed-off-by: Aaron Ang <aaron.angyd@gmail.com> * [TPU] Fix dummy loading OOM (vllm-project#16372) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [bugfix] Avoid the time consumption caused by creating dummy videos. (vllm-project#16371) * [CI][Bugfix] Pin triton version for CPU (vllm-project#16384) Signed-off-by: Roger Wang <ywang@roblox.com> * [misc] use tqdm.auto where appropriate (vllm-project#16290) Signed-off-by: Benjamin Kitor <bkitor@gigaio.com> * [Bugfix][TPU] Fix TPU validate_request (vllm-project#16369) Signed-off-by: Michael Goin <mgoin64@gmail.com> * fix sonnet dataset sample when prefix len is very small (vllm-project#16379) Signed-off-by: Chenyaaang <chenyangli@google.com> * [Model] use AutoWeightsLoader for deepseek_v2, internlm2 (vllm-project#16383) Signed-off-by: Aaron Ang <aaron.angyd@gmail.com> * [Misc] Update transformers version limits of multi-modal tests (vllm-project#16381) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix validation error for text-only Mllama 3.2 (vllm-project#16377) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models (vllm-project#16038) Signed-off-by: mgoin <mgoin64@gmail.com> * [doc] add download model tips (vllm-project#16389) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * Update Numba to 0.61.2 (vllm-project#16376) Signed-off-by: cyy <cyyever@outlook.com> * [Model] Remove image mm limit for LLaMa4 (vllm-project#16365) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> * [doc] update the wrong link (vllm-project#16401) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [CI] Add auto update workflow for Dockerfile graph (vllm-project#11879) Signed-off-by: wineandchord <guoqizhou19@gmail.com> * Fix the torch version parsing logic (vllm-project#15857) * [VLM] Remove `BaseProcessingInfo.get_mm_max_tokens_per_item` (vllm-project#16408) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [TPU][V1] Use `language_model` interface for getting text backbone in MM (vllm-project#16410) Signed-off-by: NickLucche <nlucches@redhat.com> * Improve configs - `ParallelConfig` (vllm-project#16332) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1] Set structured output backend to `auto` by default (vllm-project#15724) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [V1][Spec Decode] Eagle Model loading (vllm-project#16035) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> * [Bugfix] Fix bug when dataset is json (vllm-project#15899) Signed-off-by: Chenyaaang <chenyangli@google.com> * [Model] Reduce redundant computations in mamba2 blocks for Bamba-9B (vllm-project#15423) Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * [V1] Zero-copy tensor/ndarray serialization/transmission (vllm-project#13790) Signed-off-by: Nick Hill <nhill@redhat.com> * [VLM] Avoid unnecessary dummy multimodal data during processing (vllm-project#16416) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix output token length check logic (vllm-project#16419) Signed-off-by: look <eeslook@163.com> * [TPU][V1] Disable per-request seed/Generator (vllm-project#16172) Signed-off-by: NickLucche <nlucches@redhat.com> * Fix range_ratio Bug in RandomDataset (vllm-project#16126) Signed-off-by: jadewang21 <jadewangcn@outlook.com> * check input length of sonnet samples (vllm-project#16423) Signed-off-by: alexey-belyakov <alexey.belyakov@intel.com> * update benchmark_serving_structured_output to include auto backend (vllm-project#16438) Signed-off-by: Chenyaaang <chenyangli@google.com> * [Llama4] Enable attention temperature tuning by default for long context (>32k) (vllm-project#16439) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> * Update supported_hardware.md for TPU INT8 (vllm-project#16437) * [Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test (vllm-project#16424) Signed-off-by: Isotr0py <2037008807@qq.com> * [CPU][Bugfix] Fix CPU docker issues (vllm-project#16454) Signed-off-by: jiang.li <jiang1.li@intel.com> * [Bugfix] Don't set an upper bound on repetition penalty (vllm-project#16403) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Nick Hill <nhill@redhat.com> * Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" (vllm-project#16453) * [Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner (vllm-project#15990) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (vllm-project#16447) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] Raise error for V1 not supporting Long LoRA. (vllm-project#16415) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Misc] update api_client example (vllm-project#16459) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * Don't install triton on `ppc64le` platform (vllm-project#16470) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Kernel] support merge_attn_states CUDA kernel, 3x speedup (vllm-project#16173) Signed-off-by: DefTruth <qiustudent_r@163.com> * [Bugfix] Fix bugs of running Quark quantized models (vllm-project#16236) Signed-off-by: chaow <chaow@amd.com> * [Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU (vllm-project#12779) Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com> * Fix erroneous "model doesn't support compile" warning (vllm-project#16486) Signed-off-by: rzou <zou3519@gmail.com> * [TPU][V1] Make `--disable_chunked_mm_input` mandatory for serving MM models (vllm-project#16483) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (vllm-project#16366) Signed-off-by: mgoin <mgoin64@gmail.com> * [Doc] Document InternVL3 support (vllm-project#16495) Signed-off-by: Isotr0py <2037008807@qq.com> * [Bugfix] handle alignment of encoder_seq_lens in mllama.py (vllm-project#14784) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> * Improve configs - `LoadConfig` (vllm-project#16422) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Added chat templates for LLaMa4 pythonic tool calling (vllm-project#16463) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Kai Wu <kaiwu@meta.com> * [Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (vllm-project#16488) * Update openai_compatible_server.md (vllm-project#16507) Signed-off-by: Christian Sears <csears@redhat.com> * [Bugfix] clean up duplicated code (vllm-project#16485) Signed-off-by: Gogs <gogs@fake.local> Co-authored-by: Gogs <gogs@fake.local> * Bugfix for PixtralHF models without spatial_merge_size (vllm-project#16513) Signed-off-by: mgoin <mgoin64@gmail.com> * [Doc] Fix link to vLLM blog (vllm-project#16519) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [CI][Bugfix] Add mistral_tool_use to Ci (vllm-project#16517) Signed-off-by: mgoin <mgoin64@gmail.com> * [BugFix] Handle non-contiguous tensors properly when serializing (vllm-project#16492) Signed-off-by: Nick Hill <nhill@redhat.com> * [Doc] Update Llama4 Model Names in Supported Models (vllm-project#16509) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> * Optimized topk for topk=1 (Llama-4) (vllm-project#16512) Signed-off-by: mgoin <mgoin64@gmail.com> * [Feature][V1] Add xgrammar to support minLength, maxLength with test (vllm-project#16516) Signed-off-by: Leon Seidel <leon.seidel@fau.de> * [Frontend] support matryoshka representation / support embedding API dimensions (vllm-project#16331) * fix: spelling (vllm-project#16466) Signed-off-by: Tianer Zhou <ezhoureal@gmail.com> * [Misc] Update chat utils tests (vllm-project#16520) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] Openai transcription client example use same Whisper model (vllm-project#16487) Signed-off-by: NickLucche <nlucches@redhat.com> * [V1] Enable multi-input by default (vllm-project#15799) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [MISC] Make GroupCoordinator compatible with out-of-tree devices (vllm-project#16464) Signed-off-by: hzji210@gmail.com <hzji210@gmail.com> * [Misc] Delete redundant code (vllm-project#16530) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * Fix syntaxWarning: invalid escape sequence '\s' (vllm-project#16532) Signed-off-by: Jie Fu <jiefu@tencent.com> * [Perf] Optimize Preparing Inputs for GPU Model Runner (vllm-project#16484) Signed-off-by: snowcharm <snowcharmqq@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Bugfix] Validate logit biases to prevent out of vocab ids crashing engine (vllm-project#16529) Signed-off-by: Ryan McConville <ryan@ryanmcconville.com> * [V1][Spec Decode] KV cache slots for eagle heads (vllm-project#16370) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> * Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (vllm-project#16537) Signed-off-by: mgoin <mgoin64@gmail.com> * [Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py (vllm-project#16556) * [Core][V0] Enable regex support with xgrammar (vllm-project#13228) Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: chun37 <chun.jb.37@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Chris Thi <chris.c.thi@gmail.com> Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com> Signed-off-by: Eric <erictang000@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Matt, Matthias <matthias.matt@tuwien.ac.at> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com> Signed-off-by: yihong0618 <zouzou0208@gmail.com> Signed-off-by: Ziji Shi <shi.ziji.sm@gmail.com> Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com> Signed-off-by: wwl2755 <wangwenlong2755@gmail.com> Signed-off-by: reidliu41 <reid201711@gmail.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com> Signed-off-by: zhenwei <zhenweiliu@habana.ai> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Ben Jackson <ben@ben.com> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: paolovic <paul-philipp.luley@uzh.ch> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: YamPengLi <yampayne.lyp@alibaba-inc.com> Signed-off-by: WangErXiao <863579016@qq.com> Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Jon Swenson <jmswen@gmail.com> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Xiaodong Wang <xdwang@meta.com> Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Leon Seidel <leon.seidel@fau.de> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Miles Williams <42222518+mlsw@users.noreply.github.com> Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Signed-off-by: imkero <kerorek@outlook.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Yue <yueshen@nvidia.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com> Signed-off-by: Ajay Vohra <ajayvohr@amazon.com> Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> Signed-off-by: zh Wang <rekind133@outlook.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> Signed-off-by: Aaron Ang <aaron.angyd@gmail.com> Signed-off-by: Benjamin Kitor <bkitor@gigaio.com> Signed-off-by: Chenyaaang <chenyangli@google.com> Signed-off-by: cyy <cyyever@outlook.com> Signed-off-by: wineandchord <guoqizhou19@gmail.com> Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: look <eeslook@163.com> Signed-off-by: jadewang21 <jadewangcn@outlook.com> Signed-off-by: alexey-belyakov <alexey.belyakov@intel.com> Signed-off-by: jiang.li <jiang1.li@intel.com> Signed-off-by: DefTruth <qiustudent_r@163.com> Signed-off-by: chaow <chaow@amd.com> Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com> Signed-off-by: rzou <zou3519@gmail.com> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Christian Sears <csears@redhat.com> Signed-off-by: Gogs <gogs@fake.local> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Tianer Zhou <ezhoureal@gmail.com> Signed-off-by: hzji210@gmail.com <hzji210@gmail.com> Signed-off-by: Jie Fu <jiefu@tencent.com> Signed-off-by: snowcharm <snowcharmqq@gmail.com> Signed-off-by: Ryan McConville <ryan@ryanmcconville.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: chun <chun.jb.37@gmail.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Chris Thi <chris.c.thi@gmail.com> Co-authored-by: LukasBluebaum <38468743+LukasBluebaum@users.noreply.github.com> Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Matthias Matt <37695050+meffmadd@users.noreply.github.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: rongfu.leng <lenronfu@gmail.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Hyesoo Yang <45211235+hyeygit@users.noreply.github.com> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> Co-authored-by: Chengji Yao <chengjiyao@google.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com> Co-authored-by: yihong <zouzou0208@gmail.com> Co-authored-by: Ziji Shi (Steven) <shi.ziji.sm@gmail.com> Co-authored-by: wwl2755 <wangwenlong2755@gmail.com> Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: yarongmu-google <150371854+yarongmu-google@users.noreply.github.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Huy Do <huydhn@gmail.com> Co-authored-by: Jonghyun Choe <andy.choe729@gmail.com> Co-authored-by: liuzhenwei <zhenweiliu@habana.ai> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: ilmarkov <imarkov@redhat.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Tristan Leclercq <49700633+tristanleclercq@users.noreply.github.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com> Co-authored-by: Ben Jackson <ben@ben.com> Co-authored-by: Paul Schweigert <paul@paulschweigert.com> Co-authored-by: rongfu.leng <rongfu.leng@daocloud.io> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: paolovic <91155454+paolovic@users.noreply.github.com> Co-authored-by: paolovic <paul-philipp.luley@uzh.ch> Co-authored-by: Martin Hoyer <mhoyer@redhat.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: YamPengLi <yampayne.lyp@alibaba-inc.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Robin <863579016@qq.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Lu Fang <fanglu@fb.com> Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: leon-seidel <83984854+leon-seidel@users.noreply.github.com> Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com> Co-authored-by: Miles Williams <42222518+mlsw@users.noreply.github.com> Co-authored-by: Satyajith Chilappagari <satchill@amazon.com> Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Co-authored-by: zxfan-cpu <zxfanzhang@tencent.com> Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: TY-AMD <tianyuan.wu@amd.com> Co-authored-by: wang.yuqi <noooop@126.com> Co-authored-by: Kero Liang <kerorek@outlook.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: yueshen2016 <39203804+yueshen2016@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com> Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com> Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com> Co-authored-by: zh Wang <rekind133@outlook.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: Aaron Ang <67321817+aaron-ang@users.noreply.github.com> Co-authored-by: Jintao <huangjintao@mail.ustc.edu.cn> Co-authored-by: Benjamin Kitor <bkitor@gigaio.com> Co-authored-by: Chenyaaang <42742451+Chenyaaang@users.noreply.github.com> Co-authored-by: cyyever <cyyever@outlook.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: wineandchord <guoqizhou123123@qq.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: look <eeslook@163.com> Co-authored-by: WWW <jadewangcn@outlook.com> Co-authored-by: Alexey Belyakov <alexey.belyakov@intel.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: chaow-amd <chaow@amd.com> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Kai Wu <kaiwu@meta.com> Co-authored-by: Christian Sears <117944059+Chr1st1anSears@users.noreply.github.com> Co-authored-by: Gogs <gogs@fake.local> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Tianer Zhou <ezhoureal@gmail.com> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com> Co-authored-by: SnowCharm <qiuyilun@u.nus.edu> Co-authored-by: Ryan McConville <ryan@ryanmcconville.com>
* [V1] Fix: make sure `k_index` is int64 for `apply_top_k_only` (vllm-project#15907) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Bugfix] Fix imports for MoE on CPU (vllm-project#15841) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [V1][Minor] Enhance SpecDecoding Metrics Log in V1 (vllm-project#15902) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Doc] Update rocm.inc.md (vllm-project#15917) Signed-off-by: chun37 <chun.jb.37@gmail.com> * [V1][Bugfix] Fix typo in MoE TPU checking (vllm-project#15927) Signed-off-by: Roger Wang <ywang@roblox.com> * [Benchmark]Fix error message (vllm-project#15866) Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> * [Misc] Replace print with logger (vllm-project#15923) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [CI/Build] Further clean up LoRA tests (vllm-project#15920) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Bugfix] Fix cache block size calculation for CPU MLA (vllm-project#15848) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [Build/CI] Update lm-eval to 0.4.8 (vllm-project#15912) Signed-off-by: Chris Thi <chris.c.thi@gmail.com> * [Kernel] Add more dtype support for GGUF dequantization (vllm-project#15879) Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com> * [core] Add tags parameter to wake_up() (vllm-project#15500) Signed-off-by: Eric <erictang000@gmail.com> * [V1] Fix json_object support with xgrammar (vllm-project#15488) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Add minimum version for `huggingface_hub` to enable Xet downloads (vllm-project#15873) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][Benchmarks] Ensure `async_request_deepspeed_mii` uses the OpenAI choices key (vllm-project#15926) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [CI] Remove duplicate entrypoints-test (vllm-project#15940) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * [Bugfix] Fix the issue where the model name is empty string, causing no response with the model name. (vllm-project#15938) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Metrics] Hide deprecated metrics (vllm-project#15458) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Frontend] Implement Tool Calling with `tool_choice='required'` (vllm-project#13483) Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Matt, Matthias <matthias.matt@tuwien.ac.at> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: mgoin <michael@neuralmagic.com> * [CPU][Bugfix] Using custom allreduce for CPU backend (vllm-project#15934) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Model] use AutoWeightsLoader in model load_weights (vllm-project#15770) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Misc] V1 LoRA support CPU offload (vllm-project#15843) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Restricted cmake to be less than version 4 as 4.x breaks the build of… (vllm-project#15859) Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> * [misc] instruct pytorch to use nvml-based cuda check (vllm-project#15951) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Support Mistral3 in V1 (vllm-project#15950) Signed-off-by: mgoin <mgoin64@gmail.com> * Fix `huggingface-cli[hf-xet]` -> `huggingface-cli[hf_xet]` (vllm-project#15969) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1][TPU] TPU-optimized top-p implementation (avoids scattering). (vllm-project#15736) Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> * [TPU] optimize the all-reduce performance (vllm-project#15903) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [V1][TPU] Do not compile sampling more than needed (vllm-project#15883) Signed-off-by: NickLucche <nlucches@redhat.com> * [ROCM][KERNEL] Paged attention for V1 (vllm-project#15720) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com> * fix: better error message for get_config close vllm-project#13889 (vllm-project#15943) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [bugfix] add seed in torchrun_example.py (vllm-project#15980) Signed-off-by: youkaichao <youkaichao@gmail.com> * [ROCM][V0] PA kennel selection when no sliding window provided (vllm-project#15982) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * [Benchmark] Add AIMO Dataset to Benchmark (vllm-project#15955) Signed-off-by: Ziji Shi <shi.ziji.sm@gmail.com> Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com> * [misc] improve error message for "Failed to infer device type" (vllm-project#15994) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (vllm-project#15367) Signed-off-by: wwl2755 <wangwenlong2755@gmail.com> * [doc] update contribution link (vllm-project#15922) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * fix: tiny fix make format.sh excutable (vllm-project#16015) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [SupportsQuant] Bert, Blip, Blip2, Bloom (vllm-project#15573) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * [SupportsQuant] Chameleon, Chatglm, Commandr (vllm-project#15952) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * [Neuron][kernel] Fuse kv cache into a single tensor (vllm-project#15911) Signed-off-by: Liangfu Chen <liangfc@amazon.com> * [Minor] Fused experts refactor (vllm-project#15914) Signed-off-by: Bill Nell <bnell@redhat.com> * [Misc][Performance] Advance tpu.txt to the most recent nightly torch … (vllm-project#16024) * Re-enable the AMD Testing for the passing tests. (vllm-project#15586) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> * [TPU] Switch Test to Non-Sliding Window (vllm-project#15981) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> * [Bugfix] Fix function names in test_block_fp8.py (vllm-project#16033) Signed-off-by: Bill Nell <bnell@redhat.com> * [ROCm] Tweak the benchmark script to run on ROCm (vllm-project#14252) * [Misc] improve gguf check (vllm-project#15974) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [TPU][V1] Remove ragged attention kernel parameter hard coding (vllm-project#16041) Signed-off-by: Chengji Yao <chengjiyao@google.com> * doc: add info for macos clang errors (vllm-project#16049) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [V1][Spec Decode] Avoid logging useless nan metrics (vllm-project#16023) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt (vllm-project#15939) Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com> * [Hardware][Gaudi][BugFix] fix arguments of hpu fused moe (vllm-project#15945) Signed-off-by: zhenwei <zhenweiliu@habana.ai> * [Bugfix][kernels] Fix half2float conversion in gguf kernels (vllm-project#15995) Signed-off-by: Isotr0py <2037008807@qq.com> * [Benchmark][Doc] Update throughput benchmark and README (vllm-project#15998) Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [CPU] Change default block_size for CPU backend (vllm-project#16002) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Distributed] [ROCM] Fix custom allreduce enable checks (vllm-project#16010) Signed-off-by: ilmarkov <imarkov@redhat.com> Co-authored-by: ilmarkov <imarkov@redhat.com> * [ROCm][Bugfix] Use platform specific FP8 dtype (vllm-project#15717) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [ROCm][Bugfix] Bring back fallback to eager mode removed in vllm-project#14917, but for ROCm only (vllm-project#15413) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Bugfix] Fix default behavior/fallback for pp in v1 (vllm-project#16057) Signed-off-by: mgoin <mgoin64@gmail.com> * [CI] Reorganize .buildkite directory (vllm-project#16001) Signed-off-by: kevin <kevin@anyscale.com> * [V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (vllm-project#15906) Signed-off-by: Nick Hill <nhill@redhat.com> * [V1] Scatter and gather placeholders in the model runner (vllm-project#15712) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com> * Revert "[V1] Scatter and gather placeholders in the model runner" (vllm-project#16075) * [Kernel][Minor] Re-fuse triton moe weight application (vllm-project#16071) Signed-off-by: Bill Nell <bnell@redhat.com> * [Bugfix][TPU] Fix V1 TPU worker for sliding window (vllm-project#16059) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [V1][Spec Decode] Update N-gram Proposer Interface (vllm-project#15750) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Misc] Auto detect bitsandbytes pre-quantized models (vllm-project#16027) Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> * [CI] Fix benchmark script level (vllm-project#16089) * fix: support clang17 for macos and fix the real libomp (vllm-project#16086) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [doc] fix 404 (vllm-project#16082) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * Revert "doc: add info for macos clang errors (vllm-project#16049)" (vllm-project#16091) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * Fix some capitalisations in generated examples doc titles (vllm-project#16094) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Misc] format output for encoder_decoder.py (vllm-project#16095) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Misc] Remove redundant code (vllm-project#16098) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine (vllm-project#15946) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> * [Model] use AutoWeightsLoader for phi, gemma, deepseek (vllm-project#16088) Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com> * [Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 (vllm-project#16112) Signed-off-by: Lu Fang <fanglu@fb.com> * [Benchmark] Add sampling parameters to benchmark_serving. (vllm-project#16022) Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> * [Frontend] Fix typo in tool chat templates for llama3.2 and toolace (vllm-project#14501) Signed-off-by: Ben Jackson <ben@ben.com> * [CI][V1] Fix passing `tokenizer` as kwarg to `validate_guidance_grammar` (vllm-project#16117) Signed-off-by: Roger Wang <ywang@roblox.com> * [Misc] refactor example eagle (vllm-project#16100) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Doc][Bugfix] Add missing EOF in k8s deploy doc (vllm-project#16025) * [Misc] Improve model redirect to accept json dictionary (vllm-project#16119) Signed-off-by: Isotr0py <2037008807@qq.com> * [Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 (vllm-project#16103) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Bugfix] LoRA : Fix the order in which the kernels process LoRAs (vllm-project#16040) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix] add hf_token to EngineArgs (vllm-project#16093) Signed-off-by: paolovic <paul-philipp.luley@uzh.ch> Co-authored-by: paolovic <paul-philipp.luley@uzh.ch> * [Misc] update requires-python in pyproject.toml (vllm-project#16116) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [TPU] Update PyTorch/XLA (vllm-project#16130) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [V1][Minor] Optimize get_cached_block (vllm-project#16135) * Fix requires-python (vllm-project#16132) * [Metrics] Add bucket for `request_latency`, `time_to_first_token` and `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * [V1][Minor] Minor simplification for get_computed_blocks (vllm-project#16139) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Misc] Update Mistral-3.1 example (vllm-project#16147) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Make dummy encoder prompt padding alternative and add missing warnings (vllm-project#16129) Signed-off-by: Isotr0py <2037008807@qq.com> * [CI] Set max transformers version for Ultravox model test (vllm-project#16149) Signed-off-by: Roger Wang <ywang@roblox.com> * doc: fix some typos in doc (vllm-project#16154) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [VLM] Florence-2 supports online serving (vllm-project#16164) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1][Structured Output] Add `supports_structured_output()` method to Platform (vllm-project#16148) Signed-off-by: shen-shanshan <467638484@qq.com> * [Model] Add Qwen3 and Qwen3MoE (vllm-project#15289) Signed-off-by: YamPengLi <yampayne.lyp@alibaba-inc.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Misc] improve example mlpspeculator and llm_engine_example (vllm-project#16175) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Doc]Update image to latest version (vllm-project#16186) Signed-off-by: WangErXiao <863579016@qq.com> * Upstream Llama4 Support to Main (vllm-project#16113) Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Signed-off-by: Chris Thi <chris.c.thi@gmail.com> Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Jon Swenson <jmswen@gmail.com> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Xiaodong Wang <xdwang@meta.com> Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Lu Fang <fanglu@fb.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Re-enable support for `ChatGLMForConditionalGeneration` (vllm-project#16187) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] Revert the default `max_num_seqs` to V0 values for most hardware (vllm-project#16158) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Print the warning only once (vllm-project#16193) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Misc] Human-readable `max-model-len` cli arg (vllm-project#16181) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [Misc] Move Llama 4 projector call into encoder execution (vllm-project#16201) * [Bugfix] Fix guidance backend for Qwen models (vllm-project#16210) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> * [V1][BugFix] Exit properly if engine core fails during startup (vllm-project#16137) Signed-off-by: Nick Hill <nhill@redhat.com> * [Misc] add description attribute in CLI (vllm-project#15921) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix][V0] XGrammar structured output supports Enum (vllm-project#15878) Signed-off-by: Leon Seidel <leon.seidel@fau.de> * Torchao (vllm-project#14231) Signed-off-by: drisspg <drisspguessous@gmail.com> * [ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping (vllm-project#16031) Signed-off-by: mgoin <michael@neuralmagic.com> * [core] do not send error across process (vllm-project#16174) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Update compressed-tensors to version 0.9.3 (vllm-project#16196) Signed-off-by: Miles Williams <42222518+mlsw@users.noreply.github.com> * Update BASE_IMAGE to 2.22 release of Neuron (vllm-project#16218) * [V1] Scatter and gather placeholders in the model runner (vllm-project#16076) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> * [Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 (vllm-project#16161) * Add warning for Attention backends that do not support irope yet (vllm-project#16212) * [Bugfix] Do not skip "empty" parts of chats that are parsable (vllm-project#16219) Signed-off-by: mgoin <mgoin64@gmail.com> * [Bugfix] Fix and reorganize broken GGUF tests and bump gguf version (vllm-project#16194) Signed-off-by: Isotr0py <2037008807@qq.com> * [torch.compile][TPU] Make @support_torch_compile work for XLA backend (vllm-project#15782) Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill (vllm-project#15837) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] Merge the logs of pp layers partitions (vllm-project#16225) Signed-off-by: Kebe <mail@kebe7jun.com> * [Docs] Add Slides from Singapore Meetup (vllm-project#16213) Signed-off-by: simon-mo <simon.mo@hey.com> * [Misc] format and refactor some examples (vllm-project#16252) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Misc] Add warning for multimodal data in LLM.beam_search (vllm-project#16241) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * [Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe (vllm-project#16203) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm (vllm-project#16247) Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * [Bugfix] Remove triton do_bench fast_flush arg (vllm-project#16256) Signed-off-by: Kebe <mail@kebe7jun.com> * Update to transformers==4.51.1 (vllm-project#16257) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [New Model]: jinaai/jina-embeddings-v3 (vllm-project#16120) * [Misc] Avoid stripping meaningful whitespace from `nvidia-smi topo -m` output in collect_env.py (vllm-project#16272) Signed-off-by: imkero <kerorek@outlook.com> * [Bugfix] Proper input validation for multi-modal encoder-decoder models (vllm-project#16156) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Handle `process_weights_after_loading` for `QKVCrossParallelLinear` (vllm-project#15328) Signed-off-by: Isotr0py <2037008807@qq.com> * Add warning that content below line in template will be removed (vllm-project#16276) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (vllm-project#16209) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Bugfix] fix deepseek fp16 scale bug (vllm-project#14809) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [V1] Update structured output offline inference example (vllm-project#15721) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [CI/Build] Fix CI LoRA failure (vllm-project#16270) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Add support to modelopt quantization of Mixtral model (vllm-project#15961) Signed-off-by: Yue <yueshen@nvidia.com> * [Model] Add smolvlm support (vllm-project#16017) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs (vllm-project#16198) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> * [Bugfix] fix gettid method is not define (vllm-project#16084) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Feature] Estimate max-model-len use available KV cache memory (vllm-project#16168) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> * [Core] Upgrade to xgrammar 0.1.18, add cache size limit (vllm-project#16283) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding (vllm-project#16221) Signed-off-by: mgoin <mgoin64@gmail.com> * [TPU] Update PyTorch/XLA (vllm-project#16288) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [BugFix] Fix fusion test and add them to CI (vllm-project#16287) Signed-off-by: luka <luka@neuralmagic.com> * [Misc] Fix test_sharded_state_loader.py(vllm-project#16004) (vllm-project#16005) Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com> * [Bugfix] Avoid transferring cached multi-modal items from P0 to P1 (vllm-project#16273) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Update label-tpu mergify and remove removal bot (vllm-project#16298) * [BugFix] logger is not callable (vllm-project#16312) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [BugFix] llama4 qknorm should be not shared across head (vllm-project#16311) Signed-off-by: Lu Fang <fanglu@fb.com> * update neuron config (vllm-project#16289) Signed-off-by: Ajay Vohra <ajayvohr@amazon.com> * [BugFix] fix some typos found by typos. (vllm-project#16314) Signed-off-by: yihong0618 <zouzou0208@gmail.com> * [Model] Add `SupportsMultiModal.get_language_model` interface (vllm-project#16007) Signed-off-by: NickLucche <nlucches@redhat.com> * [Bugfix][Frontend] respect provided default guided decoding backend (vllm-project#15476) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> * Revert "Update label-tpu mergify and remove removal bot" (vllm-project#16350) * [Bugfix] Fix profiling.py (vllm-project#16202) Signed-off-by: zh Wang <rekind133@outlook.com> * [Bugfix] catch AssertionError in MistralTokenizer as ValueError (vllm-project#16344) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> * [CI]Fix hpu docker and numpy version for CI (vllm-project#16355) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Fix `benchmark_throughput.py --backend=hf` (vllm-project#16352) Signed-off-by: mgoin <mgoin64@gmail.com> * [Build/CI] Add tracing deps to vllm container image (vllm-project#15224) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Hardware] add platform-specific request validation api (vllm-project#16291) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> * [Misc] refactor Structured Outputs example (vllm-project#16322) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues (vllm-project#16275) Signed-off-by: Chengji Yao <chengjiyao@google.com> * Add GLM-4-0414 support (vllm-project#16338) Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com> Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: yihong0618 <zouzou0208@gmail.com> Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Ajay Vohra <ajayvohr@amazon.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: yihong <zouzou0208@gmail.com> Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com> Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com> * [Bugfix]: do not shutdown server if `skip_special_use=False` for MistralTokenizer (vllm-project#14094) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> * [Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral (vllm-project#16325) Signed-off-by: Aaron Ang <aaron.angyd@gmail.com> * [TPU] Fix dummy loading OOM (vllm-project#16372) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [bugfix] Avoid the time consumption caused by creating dummy videos. (vllm-project#16371) * [CI][Bugfix] Pin triton version for CPU (vllm-project#16384) Signed-off-by: Roger Wang <ywang@roblox.com> * [misc] use tqdm.auto where appropriate (vllm-project#16290) Signed-off-by: Benjamin Kitor <bkitor@gigaio.com> * [Bugfix][TPU] Fix TPU validate_request (vllm-project#16369) Signed-off-by: Michael Goin <mgoin64@gmail.com> * fix sonnet dataset sample when prefix len is very small (vllm-project#16379) Signed-off-by: Chenyaaang <chenyangli@google.com> * [Model] use AutoWeightsLoader for deepseek_v2, internlm2 (vllm-project#16383) Signed-off-by: Aaron Ang <aaron.angyd@gmail.com> * [Misc] Update transformers version limits of multi-modal tests (vllm-project#16381) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix validation error for text-only Mllama 3.2 (vllm-project#16377) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models (vllm-project#16038) Signed-off-by: mgoin <mgoin64@gmail.com> * [doc] add download model tips (vllm-project#16389) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * Update Numba to 0.61.2 (vllm-project#16376) Signed-off-by: cyy <cyyever@outlook.com> * [Model] Remove image mm limit for LLaMa4 (vllm-project#16365) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> * [doc] update the wrong link (vllm-project#16401) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [CI] Add auto update workflow for Dockerfile graph (vllm-project#11879) Signed-off-by: wineandchord <guoqizhou19@gmail.com> * Fix the torch version parsing logic (vllm-project#15857) * [VLM] Remove `BaseProcessingInfo.get_mm_max_tokens_per_item` (vllm-project#16408) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [TPU][V1] Use `language_model` interface for getting text backbone in MM (vllm-project#16410) Signed-off-by: NickLucche <nlucches@redhat.com> * Improve configs - `ParallelConfig` (vllm-project#16332) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1] Set structured output backend to `auto` by default (vllm-project#15724) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [V1][Spec Decode] Eagle Model loading (vllm-project#16035) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> * [Bugfix] Fix bug when dataset is json (vllm-project#15899) Signed-off-by: Chenyaaang <chenyangli@google.com> * [Model] Reduce redundant computations in mamba2 blocks for Bamba-9B (vllm-project#15423) Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * [V1] Zero-copy tensor/ndarray serialization/transmission (vllm-project#13790) Signed-off-by: Nick Hill <nhill@redhat.com> * [VLM] Avoid unnecessary dummy multimodal data during processing (vllm-project#16416) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix output token length check logic (vllm-project#16419) Signed-off-by: look <eeslook@163.com> * [TPU][V1] Disable per-request seed/Generator (vllm-project#16172) Signed-off-by: NickLucche <nlucches@redhat.com> * Fix range_ratio Bug in RandomDataset (vllm-project#16126) Signed-off-by: jadewang21 <jadewangcn@outlook.com> * check input length of sonnet samples (vllm-project#16423) Signed-off-by: alexey-belyakov <alexey.belyakov@intel.com> * update benchmark_serving_structured_output to include auto backend (vllm-project#16438) Signed-off-by: Chenyaaang <chenyangli@google.com> * [Llama4] Enable attention temperature tuning by default for long context (>32k) (vllm-project#16439) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> * Update supported_hardware.md for TPU INT8 (vllm-project#16437) * [Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test (vllm-project#16424) Signed-off-by: Isotr0py <2037008807@qq.com> * [CPU][Bugfix] Fix CPU docker issues (vllm-project#16454) Signed-off-by: jiang.li <jiang1.li@intel.com> * [Bugfix] Don't set an upper bound on repetition penalty (vllm-project#16403) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Nick Hill <nhill@redhat.com> * Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" (vllm-project#16453) * [Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner (vllm-project#15990) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (vllm-project#16447) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] Raise error for V1 not supporting Long LoRA. (vllm-project#16415) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Misc] update api_client example (vllm-project#16459) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * Don't install triton on `ppc64le` platform (vllm-project#16470) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Kernel] support merge_attn_states CUDA kernel, 3x speedup (vllm-project#16173) Signed-off-by: DefTruth <qiustudent_r@163.com> * [Bugfix] Fix bugs of running Quark quantized models (vllm-project#16236) Signed-off-by: chaow <chaow@amd.com> * [Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU (vllm-project#12779) Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com> * Fix erroneous "model doesn't support compile" warning (vllm-project#16486) Signed-off-by: rzou <zou3519@gmail.com> * [TPU][V1] Make `--disable_chunked_mm_input` mandatory for serving MM models (vllm-project#16483) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (vllm-project#16366) Signed-off-by: mgoin <mgoin64@gmail.com> * [Doc] Document InternVL3 support (vllm-project#16495) Signed-off-by: Isotr0py <2037008807@qq.com> * [Bugfix] handle alignment of encoder_seq_lens in mllama.py (vllm-project#14784) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> * Improve configs - `LoadConfig` (vllm-project#16422) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Added chat templates for LLaMa4 pythonic tool calling (vllm-project#16463) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Kai Wu <kaiwu@meta.com> * [Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (vllm-project#16488) * Update openai_compatible_server.md (vllm-project#16507) Signed-off-by: Christian Sears <csears@redhat.com> * [Bugfix] clean up duplicated code (vllm-project#16485) Signed-off-by: Gogs <gogs@fake.local> Co-authored-by: Gogs <gogs@fake.local> * Bugfix for PixtralHF models without spatial_merge_size (vllm-project#16513) Signed-off-by: mgoin <mgoin64@gmail.com> * [Doc] Fix link to vLLM blog (vllm-project#16519) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [CI][Bugfix] Add mistral_tool_use to Ci (vllm-project#16517) Signed-off-by: mgoin <mgoin64@gmail.com> * [BugFix] Handle non-contiguous tensors properly when serializing (vllm-project#16492) Signed-off-by: Nick Hill <nhill@redhat.com> * [Doc] Update Llama4 Model Names in Supported Models (vllm-project#16509) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> * Optimized topk for topk=1 (Llama-4) (vllm-project#16512) Signed-off-by: mgoin <mgoin64@gmail.com> * [Feature][V1] Add xgrammar to support minLength, maxLength with test (vllm-project#16516) Signed-off-by: Leon Seidel <leon.seidel@fau.de> * [Frontend] support matryoshka representation / support embedding API dimensions (vllm-project#16331) * fix: spelling (vllm-project#16466) Signed-off-by: Tianer Zhou <ezhoureal@gmail.com> * [Misc] Update chat utils tests (vllm-project#16520) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] Openai transcription client example use same Whisper model (vllm-project#16487) Signed-off-by: NickLucche <nlucches@redhat.com> * [V1] Enable multi-input by default (vllm-project#15799) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [MISC] Make GroupCoordinator compatible with out-of-tree devices (vllm-project#16464) Signed-off-by: hzji210@gmail.com <hzji210@gmail.com> * [Misc] Delete redundant code (vllm-project#16530) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * Fix syntaxWarning: invalid escape sequence '\s' (vllm-project#16532) Signed-off-by: Jie Fu <jiefu@tencent.com> * [Perf] Optimize Preparing Inputs for GPU Model Runner (vllm-project#16484) Signed-off-by: snowcharm <snowcharmqq@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Bugfix] Validate logit biases to prevent out of vocab ids crashing engine (vllm-project#16529) Signed-off-by: Ryan McConville <ryan@ryanmcconville.com> * [V1][Spec Decode] KV cache slots for eagle heads (vllm-project#16370) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> * Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (vllm-project#16537) Signed-off-by: mgoin <mgoin64@gmail.com> * [Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py (vllm-project#16556) * [Core][V0] Enable regex support with xgrammar (vllm-project#13228) Signed-off-by: Russell Bryant <rbryant@redhat.com> * capture only SP * batch_size <= max_batch_size case to cover small max_batch_size --------- Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: chun37 <chun.jb.37@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Chris Thi <chris.c.thi@gmail.com> Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com> Signed-off-by: Eric <erictang000@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Matt, Matthias <matthias.matt@tuwien.ac.at> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com> Signed-off-by: yihong0618 <zouzou0208@gmail.com> Signed-off-by: Ziji Shi <shi.ziji.sm@gmail.com> Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com> Signed-off-by: wwl2755 <wangwenlong2755@gmail.com> Signed-off-by: reidliu41 <reid201711@gmail.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com> Signed-off-by: zhenwei <zhenweiliu@habana.ai> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Ben Jackson <ben@ben.com> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: paolovic <paul-philipp.luley@uzh.ch> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: YamPengLi <yampayne.lyp@alibaba-inc.com> Signed-off-by: WangErXiao <863579016@qq.com> Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Jon Swenson <jmswen@gmail.com> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Xiaodong Wang <xdwang@meta.com> Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Leon Seidel <leon.seidel@fau.de> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Miles Williams <42222518+mlsw@users.noreply.github.com> Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Signed-off-by: imkero <kerorek@outlook.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Yue <yueshen@nvidia.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com> Signed-off-by: Ajay Vohra <ajayvohr@amazon.com> Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com> Signed-off-by: zh Wang <rekind133@outlook.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> Signed-off-by: Aaron Ang <aaron.angyd@gmail.com> Signed-off-by: Benjamin Kitor <bkitor@gigaio.com> Signed-off-by: Chenyaaang <chenyangli@google.com> Signed-off-by: cyy <cyyever@outlook.com> Signed-off-by: wineandchord <guoqizhou19@gmail.com> Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: look <eeslook@163.com> Signed-off-by: jadewang21 <jadewangcn@outlook.com> Signed-off-by: alexey-belyakov <alexey.belyakov@intel.com> Signed-off-by: jiang.li <jiang1.li@intel.com> Signed-off-by: DefTruth <qiustudent_r@163.com> Signed-off-by: chaow <chaow@amd.com> Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com> Signed-off-by: rzou <zou3519@gmail.com> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Christian Sears <csears@redhat.com> Signed-off-by: Gogs <gogs@fake.local> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Tianer Zhou <ezhoureal@gmail.com> Signed-off-by: hzji210@gmail.com <hzji210@gmail.com> Signed-off-by: Jie Fu <jiefu@tencent.com> Signed-off-by: snowcharm <snowcharmqq@gmail.com> Signed-off-by: Ryan McConville <ryan@ryanmcconville.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: chun <chun.jb.37@gmail.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Chris Thi <chris.c.thi@gmail.com> Co-authored-by: LukasBluebaum <38468743+LukasBluebaum@users.noreply.github.com> Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Matthias Matt <37695050+meffmadd@users.noreply.github.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: rongfu.leng <lenronfu@gmail.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Hyesoo Yang <45211235+hyeygit@users.noreply.github.com> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> Co-authored-by: Chengji Yao <chengjiyao@google.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com> Co-authored-by: yihong <zouzou0208@gmail.com> Co-authored-by: Ziji Shi (Steven) <shi.ziji.sm@gmail.com> Co-authored-by: wwl2755 <wangwenlong2755@gmail.com> Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: yarongmu-google <150371854+yarongmu-google@users.noreply.github.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Huy Do <huydhn@gmail.com> Co-authored-by: Jonghyun Choe <andy.choe729@gmail.com> Co-authored-by: liuzhenwei <zhenweiliu@habana.ai> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: ilmarkov <imarkov@redhat.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Tristan Leclercq <49700633+tristanleclercq@users.noreply.github.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com> Co-authored-by: Ben Jackson <ben@ben.com> Co-authored-by: Paul Schweigert <paul@paulschweigert.com> Co-authored-by: rongfu.leng <rongfu.leng@daocloud.io> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: paolovic <91155454+paolovic@users.noreply.github.com> Co-authored-by: paolovic <paul-philipp.luley@uzh.ch> Co-authored-by: Martin Hoyer <mhoyer@redhat.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: YamPengLi <yampayne.lyp@alibaba-inc.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Robin <863579016@qq.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Lu Fang <fanglu@fb.com> Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: leon-seidel <83984854+leon-seidel@users.noreply.github.com> Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com> Co-authored-by: Miles Williams <42222518+mlsw@users.noreply.github.com> Co-authored-by: Satyajith Chilappagari <satchill@amazon.com> Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Co-authored-by: zxfan-cpu <zxfanzhang@tencent.com> Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: TY-AMD <tianyuan.wu@amd.com> Co-authored-by: wang.yuqi <noooop@126.com> Co-authored-by: Kero Liang <kerorek@outlook.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: yueshen2016 <39203804+yueshen2016@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com> Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com> Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com> Co-authored-by: zh Wang <rekind133@outlook.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: Aaron Ang <67321817+aaron-ang@users.noreply.github.com> Co-authored-by: Jintao <huangjintao@mail.ustc.edu.cn> Co-authored-by: Benjamin Kitor <bkitor@gigaio.com> Co-authored-by: Chenyaaang <42742451+Chenyaaang@users.noreply.github.com> Co-authored-by: cyyever <cyyever@outlook.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: wineandchord <guoqizhou123123@qq.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: look <eeslook@163.com> Co-authored-by: WWW <jadewangcn@outlook.com> Co-authored-by: Alexey Belyakov <alexey.belyakov@intel.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: chaow-amd <chaow@amd.com> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Kai Wu <kaiwu@meta.com> Co-authored-by: Christian Sears <117944059+Chr1st1anSears@users.noreply.github.com> Co-authored-by: Gogs <gogs@fake.local> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Tianer Zhou <ezhoureal@gmail.com> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com> Co-authored-by: SnowCharm <qiuyilun@u.nus.edu> Co-authored-by: Ryan McConville <ryan@ryanmcconville.com>
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Task 1 of #15901
Some limitations:
How to run this PR:
python examples/offline_inference/eagle.py
Ignore the metrics used/printed in that file.