Skip to content

Conversation

lgeiger
Copy link
Contributor

@lgeiger lgeiger commented May 27, 2025

Gemma3 by default uses the bfloat16 dtype, however the hugging face processor outputs the pixel values of the 896x896 images in float32 which means these values need to be casted to the gemma3 data type. This currently happens on GPU.

This PR moves the casting to CPU during the input processing which halves the amount of image data that needs to be copied from CPU to GPU and should also save a bit of GPU memory in prompts with lots of images.

This change seems to improve throughput by 2.4 % of the 4B model on a L40s GPU and improves latency, measured using the following script:

vllm serve google/gemma-3-4b-it --disable-log-requests
python benchmarks/benchmark_serving.py --backend openai-chat --model google/gemma-3-4b-it --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 5000

Baseline (based on #18710):

============ Serving Benchmark Result ============
Successful requests:                     985
Benchmark duration (s):                  89.50
Total input tokens:                      95454
Total generated tokens:                  115301
Request throughput (req/s):              11.01
Output token throughput (tok/s):         1288.35
Total Token throughput (tok/s):          2354.93
---------------Time to First Token----------------
Mean TTFT (ms):                          52647.14
Median TTFT (ms):                        44711.12
P99 TTFT (ms):                           84957.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          112.01
Median TPOT (ms):                        111.66
P99 TPOT (ms):                           283.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           109.95
Median ITL (ms):                         50.29
P99 ITL (ms):                            423.10
==================================================

This PR:

============ Serving Benchmark Result ============
Successful requests:                     985
Benchmark duration (s):                  87.41
Total input tokens:                      95454
Total generated tokens:                  115232
Request throughput (req/s):              11.27
Output token throughput (tok/s):         1318.31
Total Token throughput (tok/s):          2410.35
---------------Time to First Token----------------
Mean TTFT (ms):                          52161.96
Median TTFT (ms):                        43351.32
P99 TTFT (ms):                           82530.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          102.06
Median TPOT (ms):                        100.74
P99 TPOT (ms):                           263.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           100.50
Median ITL (ms):                         48.11
P99 ITL (ms):                            405.93
==================================================

I'm very new to the vllm code base, so please let me know if this could lead to problems with other dtype/quantisation configs. I'm also not sure whether this is the best benchmark to evaluate this change or how noisy these benchmarks tend to be.

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, thanks for fixing! cc @ywang96 @Isotr0py perhaps we should do this for the other processors as well?

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) May 27, 2025 03:57
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 27, 2025
@DarkLight1337 DarkLight1337 merged commit b50602d into vllm-project:main May 27, 2025
75 of 77 checks passed
@Isotr0py
Copy link
Member

perhaps we should do this for the other processors as well?

I think we can cast dtype in BaseMultimodalProcessor's apply method, so that we con't need to make modification for each processor.

@lgeiger lgeiger deleted the gemma-image-cast-on-cpu branch May 27, 2025 08:01
gshtras added a commit to ROCm/vllm that referenced this pull request May 27, 2025
* Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup (vllm-project#18337)

* [Misc] Fix typo (vllm-project#18330)

* Neuron up mistral (vllm-project#18222)

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

* fix CUDA_check redefinition in vllm-project#17918 (vllm-project#18287)

Signed-off-by: Lucia Fang <fanglu@fb.com>
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com>

* [neuron] fix authorization issue (vllm-project#18364)

Signed-off-by: Liangfu Chen <liangfc@amazon.com>

* [Misc] Allow `AutoWeightsLoader` to skip loading weights with specific substr in name (vllm-project#18358)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Core] [Bugfix]: tensor parallel with prompt embeds (vllm-project#18171)

Signed-off-by: Nan2018 <nan@protopia.ai>
Co-authored-by: Andrew Sansom <andrew@protopia.ai>

* [release] Change dockerhub username for TPU release (vllm-project#18389)

* [Bugfix] fix adding bias twice in ipex GPTQ quantization (vllm-project#18363)

Signed-off-by: rand-fly <randfly@outlook.com>

* [doc] update env variable export (vllm-project#18391)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Misc] Add LoRA code owner (vllm-project#18387)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Update cpu.txt (vllm-project#18398)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [CI] Add mteb testing to test the accuracy of the embedding model (vllm-project#17175)

* [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18407)

Co-authored-by: 松灵 <wpf272043@alibaba-inc.com>

* [Misc] refactor prompt embedding examples (vllm-project#18405)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Minor] Rename quantization nvfp4 to modelopt_fp4 (vllm-project#18356)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Model] use AutoWeightsLoader for bloom (vllm-project#18300)

Signed-off-by: calvin chen <120380290@qq.com>

* [Kernel] update comment for KV shape in unified triton attn (vllm-project#18099)

Signed-off-by: haochengxia <xhc_1007@163.com>

* fix:Build torch wheel inline rather than picking from nightly (vllm-project#18351)

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

* [TPU] Re-enable the Pallas MoE kernel (vllm-project#18025)

Signed-off-by: Michael Goin <mgoin64@gmail.com>

* [Bugfix] config.head_dim is now explicitly set to None (vllm-project#18432)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [Bug] Fix moe_sum signature (vllm-project#18440)

Signed-off-by: Bill Nell <bnell@redhat.com>

* Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18407)" (vllm-project#18456)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix][Failing Test] Fix nixl connector test when promt size < block size (vllm-project#18429)

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

* [Misc] MultiConnector._connectors type (vllm-project#18423)

Signed-off-by: nicklucche <nlucches@redhat.com>

* [Frontend] deprecate `--device` arg (vllm-project#18399)

Signed-off-by: Kebe <mail@kebe7jun.com>

* [V1] Fix general plugins not loaded in engine for multiproc (vllm-project#18326)

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

* [Misc] refactor disaggregated-prefill-v1 example (vllm-project#18474)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix][Failing Test] Fix test_events.py (vllm-project#18460)

Signed-off-by: rabi <ramishra@redhat.com>

* [MODEL] FalconH1 (vllm-project#18406)

Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae>
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae>

* [Doc] fix arg docstring in linear layers (vllm-project#18410)

Signed-off-by: giantcroc <1204449533@qq.com>

* [Bugfix] Reduce moe_sum test size to avoid OOM (vllm-project#18484)

Signed-off-by: Bill Nell <bnell@redhat.com>

* [Build] fix Dockerfile shell (vllm-project#18402)

* [Misc] Update deprecation message for `--enable-reasoning` (vllm-project#18404)

* [ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 (vllm-project#17004)

Signed-off-by: Hosang Yoon <hosang.yoon@amd.com>

* Remove incorrect env value

* Revert "[v1] Support multiple KV cache groups in GPU model runner (vllm-project#17945) (vllm-project#18459)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [FEAT][ROCm] Upgrade AITER MLA v1 backend (vllm-project#18338)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

* [Bugfix] Consistent ascii handling in tool parsers (vllm-project#17704)

Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>

* [FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) (vllm-project#18500)

Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae>
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae>

* [MISC] update project urls in pyproject.toml (vllm-project#18519)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [CI] Fix race condition with StatelessProcessGroup.barrier (vllm-project#18506)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* Intialize io_thread_pool attribute in the beginning. (vllm-project#18331)

Signed-off-by: rabi <ramishra@redhat.com>

* [Bugfix] Inconsistent token calculation compared to HF in llava family (vllm-project#18479)

Signed-off-by: jaycha <jaycha@ncsoft.com>

* [BugFix][DP] Send DP wave completion only from `dp_rank==0` (vllm-project#18502)

Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com>

* [Bugfix][Model] Make Olmo2Model weight loading return loaded weights (vllm-project#18504)

Signed-off-by: Shane A <shanea@allenai.org>

* [Bugfix] Fix LoRA test (vllm-project#18518)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Doc] Fix invalid JSON in example args (vllm-project#18527)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) (vllm-project#18512)

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

* Update default neuron config for speculation (vllm-project#18274)

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com>
Co-authored-by: Aakash Shetty <sheaak@amazon.com>

* Order sequence ids + config update to support specifying custom quantization layers (vllm-project#18279)

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Co-authored-by: Tailin Pan <tailinpa@amazon.com>
Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com>
Co-authored-by: Yishan McNabb <yishanm@amazon.com>
Co-authored-by: Patrick Lange <patlange@amazon.com>
Co-authored-by: Maxwell Goldberg <mgld@amazon.com>
Co-authored-by: Aakash Shetty <sheaak@amazon.com>

* [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (vllm-project#18526)

Co-authored-by: 松灵 <wpf272043@alibaba-inc.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Add kwargs to RequestOutput __init__ to be forward compatible (vllm-project#18513)

Signed-off-by: Linkun <github@lkchen.net>

* [CI/Build] Update bamba test model location (vllm-project#18544)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Doc] Support --stream arg in openai_completion_client.py script (vllm-project#18388)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [Bugfix] Use random hidden states in dummy sampler run (vllm-project#18543)

Signed-off-by: Bowen Wang <abmfy@icloud.com>

* [Doc] Add stream flag for chat completion example (vllm-project#18524)

Signed-off-by: calvin chen <120380290@qq.com>

* [BugFix][CPU] Fix x86 SHM distributed module initialization (vllm-project#18536)

Signed-off-by: jiang.li <jiang1.li@intel.com>

* [Misc] improve Automatic Prefix Caching example (vllm-project#18554)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Misc] Call `ndarray.tobytes()` directly instead of `ndarray.data.tobytes()` (vllm-project#18347)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* [Bugfix] make `test_openai_schema.py` pass (vllm-project#18224)

Signed-off-by: David Xia <david@davidxia.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Platform] Move platform check to right place (vllm-project#18470)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Compile][Platform] Make PiecewiseBackend pluggable and extendable (vllm-project#18076)

Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* [Build/CI] Fix CUDA 11.8 build (vllm-project#17679)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Tool] Add NIXL installation script (vllm-project#18172)

Signed-off-by: Linkun <github@lkchen.net>

* [V1][Spec Decode][Bugfix] Load quantize weights for EAGLE (vllm-project#18290)

* [Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser (vllm-project#17917)

Signed-off-by: Kai Wu <kaiwu@meta.com>

* [Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization (vllm-project#17926)

Signed-off-by: Sanger Steel <sangersteel@gmail.com>

* [AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh (vllm-project#18568)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. (vllm-project#18569)

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

* [V1][Spec Decoding] Use model_loader.get_model() to load models (vllm-project#18273)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* Enable hybrid attention models for Transformers backend (vllm-project#18494)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs (vllm-project#18482)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [BugFix] Increase TP execute_model timeout (vllm-project#18558)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Bugfix] Set `KVTransferConfig.engine_id` in post_init (vllm-project#18576)

Signed-off-by: Linkun Chen <github@lkchen.net>

* [Spec Decode] Make EAGLE3 draft token ID mapping optional (vllm-project#18488)

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Neuron] Remove bypass on EAGLEConfig and add a test (vllm-project#18514)

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

* [Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key (vllm-project#17291)

Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com>

* [Misc] Replace `cuda` hard code with `current_platform` (vllm-project#16983)

Signed-off-by: shen-shanshan <467638484@qq.com>

* [Hardware] correct method signatures for HPU,ROCm,XPU (vllm-project#18551)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (vllm-project#18034)

Signed-off-by: Ronald Xu <ronaldxu@amazon.com>

* [Feature]Add async tensor parallelism using compilation pass (vllm-project#17882)

Signed-off-by: cascade812 <cascade812@outlook.com>

* [Doc] Update quickstart and install for cu128 using `--torch-backend=auto` (vllm-project#18505)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Feature][V1]: suupports cached_tokens in response usage (vllm-project#18149)

Co-authored-by: simon-mo <xmo@berkeley.edu>

* [Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform (vllm-project#18430)

Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Co-authored-by: Yuqi Zhang <yuqizhang@google.com>

* Migrate docs from Sphinx to MkDocs (vllm-project#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (vllm-project#18034)" (vllm-project#18600)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix][Model] Fix baichuan model loader for tp (vllm-project#18597)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled (vllm-project#17731)

Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>

* Add myself as docs code owner (vllm-project#18605)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to `requirements/cpu.txt`  (vllm-project#18542)

Signed-off-by: Kay Yan <kay.yan@daocloud.io>

* [CI] fix kv_cache_type argument (vllm-project#18594)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [Doc] Fix indent of contributing to vllm (vllm-project#18611)

Signed-off-by: Zerohertz <ohg3417@gmail.com>

* Replace `{func}` with mkdocs style links (vllm-project#18610)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [CI/Build] Fix V1 flag being set in entrypoints tests (vllm-project#18598)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Fix examples with code blocks in docs (vllm-project#18609)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Bugfix] Fix transformers model impl ignored for mixtral quant (vllm-project#18602)

Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com>

* Include private attributes in API documentation (vllm-project#18614)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Misc] add Haystack integration (vllm-project#18601)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS (vllm-project#18579)

* [Doc] Fix markdown list indentation for MkDocs rendering (vllm-project#18620)

Signed-off-by: Zerohertz <ohg3417@gmail.com>

* [Doc] Use a different color for the announcement (vllm-project#18616)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Refactor pplx init logic to make it modular (prepare for deepep) (vllm-project#18200)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Fix figures in design doc (vllm-project#18612)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Docs] Change mkdocs to not use directory urls (vllm-project#18622)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [v1] Redo "Support multiple KV cache groups in GPU model runner (vllm-project#17945)" (vllm-project#18593)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Doc] fix list formatting (vllm-project#18624)

Signed-off-by: David Xia <david@davidxia.com>

* [Doc] Fix top-level API links/docs (vllm-project#18621)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Avoid documenting dynamic / internal modules (vllm-project#18626)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar (vllm-project#18627)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1] Support Deepseek MTP (vllm-project#18435)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Co-authored-by: Rui Qiao <ruisearch42@gmail.com>

* Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI (vllm-project#18537)

Signed-off-by: Huy Do <huydhn@gmail.com>

* [CI] Enable test_initialization to run on V1 (vllm-project#16736)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [Doc] Update references to doc files (vllm-project#18637)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation (vllm-project#18160)

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

* [Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking (vllm-project#18454)

Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com>
Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com>

* [Bugfix][Nixl] Fix Preemption Bug (vllm-project#18631)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>

* config.py: Clarify that only local GGUF checkpoints are supported. (vllm-project#18623)

Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com>

* FIX MOE issue in AutoRound format (vllm-project#18586)

Signed-off-by: wenhuach21 <wenhua.cheng@intel.com>

* [V1][Spec Decode] Small refactors to improve eagle bookkeeping performance (vllm-project#18424)

Signed-off-by: qizixi <qizixi@meta.com>

* [Frontend] improve vllm serve --help display (vllm-project#18643)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) (vllm-project#18647)

* [V1][Spec Decode] Support multi-layer eagle draft model (vllm-project#18030)

Signed-off-by: qizixi <qizixi@meta.com>

* [Doc] Update README links, mark external links (vllm-project#18635)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [MISC][pre-commit] Add pre-commit check for triton import (vllm-project#17716)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [Doc] Fix indentation problems in V0 Paged Attention docs (vllm-project#18659)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Add community links (vllm-project#18657)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] use AutoWeightsLoader for gpt2 (vllm-project#18625)

Signed-off-by: zt2370 <ztang2370@gmail.com>

* [Doc] Reorganize user guide (vllm-project#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI/Build] `chmod +x` to `cleanup_pr_body.sh` (vllm-project#18650)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [MISC] typo fix and clean import (vllm-project#18664)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [BugFix] Fix import error for fused_moe (vllm-project#18642)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [CI] enforce import regex instead of re (vllm-project#18665)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* fix(regression): clone from reference items (vllm-project#18662)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [CI/Build] fix permission denied issue (vllm-project#18645)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding (vllm-project#18668)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... (vllm-project#18640)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

* [MISC] correct signature for LoaderFunction (vllm-project#18670)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [Misc]Replace `cuda` hard code with `current_platform` in Ray (vllm-project#14668)

Signed-off-by: noemotiovon <757486878@qq.com>

* [Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE (vllm-project#18655)

Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [VLM] Initialize video input support for InternVL models (vllm-project#18499)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* Speed up the `kernels/quantization/` tests (vllm-project#18669)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [BUGFIX] catch subclass first for try...except (vllm-project#18672)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [Misc] Reduce logs on startup (vllm-project#18649)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [doc] fix broken links (vllm-project#18671)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [doc] improve readability (vllm-project#18675)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment (vllm-project#18674)

Signed-off-by: zzzyq <zhangyuqi94@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [CI/build] fix no regex (vllm-project#18676)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Misc] small improve (vllm-project#18680)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] Fix profiling dummy data for Pixtral (vllm-project#18677)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Core][Multimodal] Convert PIL Image to array without data copy when hashing (vllm-project#18682)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* [CI/Build][Doc] Update `gte-Qwen2-1.5B-instruct` usage (vllm-project#18683)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example (vllm-project#18644)

Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com>
Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com>
Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com>

* refactor: simplify request handler, use positive condition check for handler assignment (vllm-project#18690)

Signed-off-by: googs1025 <googs1025@gmail.com>

* [Bugfix] Fix the lm_head in gpt_bigcode in lora mode (vllm-project#6357)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

* [CI] add missing argument (vllm-project#18694)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [GH] Add issue template for reporting CI failures (vllm-project#18696)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Fix issue template format (vllm-project#18699)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix Mistral-format models with sliding window (vllm-project#18693)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI/Build] Replace `math.isclose` with `pytest.approx` (vllm-project#18703)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI] fix dump_input for str type (vllm-project#18697)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* [Model] Add support for YARN in NemotronNAS models (vllm-project#18427)

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

* [CI/Build] Split pooling and generation extended language models tests in CI (vllm-project#18705)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI (vllm-project#18709)

Signed-off-by: Lukasz Durejko <ldurejko@habana.ai>

* [Misc] add AutoGen integration (vllm-project#18712)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM (vllm-project#18701)

* [Doc] Improve API docs (vllm-project#18713)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Move examples and further reorganize user guide (vllm-project#18666)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix Llama GGUF initialization (vllm-project#18717)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs (vllm-project#18608)

* Convert `examples` to `ruff-format` (vllm-project#18400)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Model][Gemma3] Simplify image input validation (vllm-project#18710)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* [Misc] improve web section group title display (vllm-project#18684)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [V1][Quantization] Add CUDA graph compatible v1 GGUF support (vllm-project#18646)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <2037008807@qq.com>

* [Model][Gemma3] Cast image pixel values already on CPU (vllm-project#18732)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

* [FEAT] [ROCm] Upgrade AITER Fused MoE kernels. (vllm-project#18271)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Doc] Update OOT model docs (vllm-project#18742)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Doc] Update reproducibility doc and example (vllm-project#18741)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] improve docs (vllm-project#18734)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* feat(rocm-support): support mamba2 on rocm (vllm-project#18565)

Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>

* [Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh (vllm-project#18752)

Signed-off-by: Lukasz Durejko <ldurejko@habana.ai>

* [Doc] cleanup deprecated flag for doc (vllm-project#18715)

Signed-off-by: calvin chen <120380290@qq.com>

* Minor fix about MooncakeStoreConnector (vllm-project#18721)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* [Build] fix cpu build missing libtbbmalloc.so (vllm-project#18744)

Signed-off-by: Kebe <mail@kebe7jun.com>

* [BUG FIX] minicpm (vllm-project#18739)

Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com>
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com>

* [Doc]  Convert Sphinx directives ( `{class}`, `{meth}`, `{attr}`, ...) to MkDocs format for better documentation linking (vllm-project#18663)

Signed-off-by: Zerohertz <ohg3417@gmail.com>

* [CI/Build] Remove imports of built-in `re` (vllm-project#18750)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1][Metrics] Add API for accessing in-memory Prometheus metrics (vllm-project#17010)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* Disable prefix cache by default for benchmark (vllm-project#18639)

Signed-off-by: cascade812 <cascade812@outlook.com>

* optimize get_kv_cache_torch_dtype (vllm-project#18531)

Signed-off-by: idellzheng <idellzheng@tencent.com>

* [Core] Automatically cast multi-modal input dtype (vllm-project#18756)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Mistral tool calling when content is list (vllm-project#18729)

Signed-off-by: mgoin <mgoin64@gmail.com>

---------

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Nan2018 <nan@protopia.ai>
Signed-off-by: rand-fly <randfly@outlook.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: calvin chen <120380290@qq.com>
Signed-off-by: haochengxia <xhc_1007@163.com>
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: rabi <ramishra@redhat.com>
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae>
Signed-off-by: giantcroc <1204449533@qq.com>
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: jaycha <jaycha@ncsoft.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Shane A <shanea@allenai.org>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Signed-off-by: Linkun <github@lkchen.net>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: googs1025 <googs1025@gmail.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: David Xia <david@davidxia.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: Kai Wu <kaiwu@meta.com>
Signed-off-by: Sanger Steel <sangersteel@gmail.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Signed-off-by: Linkun Chen <github@lkchen.net>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
Signed-off-by: cascade812 <cascade812@outlook.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com>
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com>
Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: zt2370 <ztang2370@gmail.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: zzzyq <zhangyuqi94@gmail.com>
Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com>
Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Co-authored-by: sunyicode0012 <116338547+sunyicode0012@users.noreply.github.com>
Co-authored-by: Gong Shufan <2624542821@qq.com>
Co-authored-by: Satyajith Chilappagari <satchill@amazon.com>
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com>
Co-authored-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Nan Qin <nan@protopia.ai>
Co-authored-by: Andrew Sansom <andrew@protopia.ai>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Random Fly <renfei8@live.cn>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: 燃 <wulipc@163.com>
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Calvin Chen <45745657+calvin0327@users.noreply.github.com>
Co-authored-by: Percy <xhc_1007@163.com>
Co-authored-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: wwl2755 <wangwenlong2755@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com>
Co-authored-by: Rabi Mishra <ramishra@redhat.com>
Co-authored-by: Dhia Eddine Rhaiem <163106757+dhiaEddineRhaiem@users.noreply.github.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae>
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae>
Co-authored-by: GiantCroc <1204449533@qq.com>
Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
Co-authored-by: Hosang <156028780+hyoon1@users.noreply.github.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Sebastian Schoennenbeck <sebastian.schoennenbeck@comma-soft.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: youngrok cha <line0930@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Shane A <shanea@allenai.org>
Co-authored-by: aws-elaineyz <elaineyz@amazon.com>
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com>
Co-authored-by: Aakash Shetty <sheaak@amazon.com>
Co-authored-by: Tailin Pan <tailinpa@amazon.com>
Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com>
Co-authored-by: Yishan McNabb <yishanm@amazon.com>
Co-authored-by: Patrick Lange <patlange@amazon.com>
Co-authored-by: Maxwell Goldberg <mgld@amazon.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: lkchen <github@lkchen.net>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: CYJiang <86391540+googs1025@users.noreply.github.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: David Xia <david@davidxia.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Kai Wu <kaiwu@meta.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Chenheli Hua <huachenheli@outlook.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Teruaki Ishizaki <tell.ishi@gmail.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: RonaldBXu <72748153+RonaldBXu@users.noreply.github.com>
Co-authored-by: cascade <cascade812@outlook.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Yuqi Zhang <zhangyuqi94@gmail.com>
Co-authored-by: Yuqi Zhang <yuqizhang@google.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Kay Yan <kay.yan@daocloud.io>
Co-authored-by: Tristan Leclercq <49700633+tristanleclercq@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Jiayi Yao <82156730+YaoJiayi@users.noreply.github.com>
Co-authored-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Feng XiaoLong <79261065+Crucifixion-Fxl@users.noreply.github.com>
Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Mathieu Borderé <mathieu@bordere.org>
Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com>
Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com>
Co-authored-by: Yuanhao WU <Nalkey@users.noreply.github.com>
Co-authored-by: ztang2370 <ztang2370@gmail.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: AlexZhao <zhaohaidao2008@hotmail.com>
Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com>
Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com>
Co-authored-by: Naveassaf <55059536+Naveassaf@users.noreply.github.com>
Co-authored-by: Łukasz Durejko <lukasz.durejko@intel.com>
Co-authored-by: dylan <xuhao296@qq.com>
Co-authored-by: almersawi <43927639+almersawi@users.noreply.github.com>
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
Co-authored-by: Łukasz Durejko <ldurejko@habana.ai>
Co-authored-by: maobaolong <baoloongmao@tencent.com>
Co-authored-by: Shawn Huang <57223022+huangyuxiang03@users.noreply.github.com>
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com>
Co-authored-by: chunxiaozheng <55471457+chunxiaozheng@users.noreply.github.com>
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
…18732)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: amit <amit.man@gmail.com>
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
…18732)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
wpyszka pushed a commit to HabanaAI/vllm-fork that referenced this pull request Aug 1, 2025
This PR contains following changes
1. Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a
few extra fixes including..
- Sliding FusedSDPA kernel, we are adding threshold variable to enable
or disable to use optimized kernel. This kernel will be
performance/memory benefit for longer sequence. We are providing
environment variable to control per customer request.
- Based on the threshold, choose different prompt bucket, if it's
smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use
SLICE_SIZE.
 - Added mark_step before SLIDING FusedSDPA is run. 
 - Misc fixes for bucket related issue. 
 2. upstream fixes
 vllm-project#18732
vllm-project#21479
vllm-project#19788

3. optimized Gemma3RMSNorm with FusedRMSNorm
Dependent on #1647 


Run command with. 
VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false
VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024
VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1

---------

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Hongmin Fan <fanhongmin@google.com>
Co-authored-by: Henry Tang <ytang@habana.ai>
Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
Co-authored-by: Shiv Kaul <skaul@habana.ai>
Co-authored-by: Shiv Kaul <shiv.kaul@intel.com>
Co-authored-by: Libin Tang <libin.tang@intel.com>
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Hongmin Fan <fanhongmin@google.com>
Co-authored-by: Harish Subramony <hsubramony@habana.ai>
Co-authored-by: Jianhong-Zhang <jianhong.zhang@intel.com>
Co-authored-by: Libin Tang <litang@habana.ai>
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
tianyuan211 added a commit to tianyuan211/vllm-fork that referenced this pull request Aug 7, 2025
commit 0884eb4
Author: Jimin Ha <jimin.ha@intel.com>
Date:   Fri Aug 1 05:42:09 2025 -0700

    Gemma3 v1.22  changes (Sliding_Window feature  + few others) (HabanaAI#1660)

    This PR contains following changes
    1. Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a
    few extra fixes including..
    - Sliding FusedSDPA kernel, we are adding threshold variable to enable
    or disable to use optimized kernel. This kernel will be
    performance/memory benefit for longer sequence. We are providing
    environment variable to control per customer request.
    - Based on the threshold, choose different prompt bucket, if it's
    smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use
    SLICE_SIZE.
     - Added mark_step before SLIDING FusedSDPA is run.
     - Misc fixes for bucket related issue.
     2. upstream fixes
     vllm-project#18732
    vllm-project#21479
    vllm-project#19788

    3. optimized Gemma3RMSNorm with FusedRMSNorm
    Dependent on HabanaAI#1647

    Run command with.
    VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false
    VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024
    VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1

    ---------

    Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
    Signed-off-by: Hongmin Fan <fanhongmin@google.com>
    Co-authored-by: Henry Tang <ytang@habana.ai>
    Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
    Co-authored-by: Shiv Kaul <skaul@habana.ai>
    Co-authored-by: Shiv Kaul <shiv.kaul@intel.com>
    Co-authored-by: Libin Tang <libin.tang@intel.com>
    Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
    Co-authored-by: Hongmin Fan <fanhongmin@google.com>
    Co-authored-by: Harish Subramony <hsubramony@habana.ai>
    Co-authored-by: Jianhong-Zhang <jianhong.zhang@intel.com>
    Co-authored-by: Libin Tang <litang@habana.ai>
    Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

commit 065fde3
Author: Jan Kaniecki <jan.kaniecki@intel.com>
Date:   Thu Jul 31 15:42:13 2025 +0200

    Remove inference_mode() from platforms.hpu (HabanaAI#1690)

    Inference_mode() is causing recompilations with t.compile - we don't
    need it as we already put inference_mode on particular functions in
    model runner. It was introduced by Rebase 0.9.0.1
    (HabanaAI#1507) - previously we didn't
    have such call.

commit 7d6528e
Author: Krzysztof Smusz <ksmusz@habana.ai>
Date:   Wed Jul 30 12:19:34 2025 +0200

    Set hpu-extension to 61dafb3 (HabanaAI#1683)

    Upgrading vllm-hpu-extension with change introducing the fix for
    unsupported block_softmax_adjustment in fp16 precision

commit ff9bff9
Author: Iryna Boiko <iboiko@habana.ai>
Date:   Tue Jul 29 09:19:29 2025 +0200

    Remove dtype.float16 support for hpu config (HabanaAI#1650)

commit 034c756
Author: Chendi.Xue <chendi.xue@intel.com>
Date:   Tue Jul 29 02:17:44 2025 -0500

    [SW-234344] Fix 'RotaryEmbedding' object has no attribute 'sin' (HabanaAI#1659)

    ## Essential Elements of an Effective PR Description Checklist
    - [x] The purpose of the PR, such as "Fix some issue (link existing
    issues this PR will resolve)".
    - [ ] The test plan, such as providing test command.
    - [ ] The test results, such as pasting the results comparison before
    and after, or e2e results

    ## Purpose

    port commit from HabanaAI#1658 for fixing SW-234344 for habana_main

    ## Test Plan

    ## Test Result

    <!--- pyml disable-next-line no-emphasis-as-heading -->

    Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

commit e5a6120
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Tue Jul 29 08:53:48 2025 +0200

    1.22 Warmup one context more - linear - Update sha extension (HabanaAI#1655)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
    Co-authored-by: Jan Kaniecki <jan.kaniecki@intel.com>

commit 9957ca7
Author: Michał Kuligowski <mkuligowski@habana.ai>
Date:   Tue Jul 29 08:52:48 2025 +0200

    ValueError: 'aimv2' is already used by a Transformers config, pick an… (HabanaAI#1673)

    Fix cherrypicked from upstream
    https://github.com/vllm-project/vllm/pull/20921/files

commit f1b60b4
Author: Mohit Deopujari <mdeopujari@habana.ai>
Date:   Thu Jul 24 08:07:04 2025 -0700

    Gemma3 suppport: propogation : pr1589/1597/1558 to v1.22.0_next (HabanaAI#1616)

    Added support for FusedSDPA kernel with window_size for Gemma3.
    This PR relies on vllm-hpu-extension
    [PR302](HabanaAI/vllm-hpu-extension#302)

    ---------

    Co-authored-by: Shiv Kaul <skaul@habana.ai>
    Co-authored-by: Shiv Kaul <shiv.kaul@intel.com>
    Co-authored-by: Jimin Ha <jimin.ha@intel.com>
    Co-authored-by: Henry Tang <ytang@habana.ai>
    Co-authored-by: Libin Tang <litang@habana.ai>
    Co-authored-by: Libin Tang <libin.tang@intel.com>
    Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

commit 59b8f75
Author: Artur Fierka <artur.fierka@intel.com>
Date:   Thu Jul 24 13:11:57 2025 +0200

    Update hpu.txt on 1.22.0 branch (HabanaAI#1648)

    Set extension SHA for Port: Fix: Round up to sliding window threshold
    HabanaAI#307 (HabanaAI#309)

commit d6b00f4
Author: Artur Fierka <artur.fierka@intel.com>
Date:   Wed Jul 23 15:50:14 2025 +0200

    [Security] Fix: Bad use of null-like value (HabanaAI#1634)

    Signed-off-by: Artur Fierka <artur.fierka@intel.com>

commit 66858d6
Author: Artur Fierka <artur.fierka@intel.com>
Date:   Wed Jul 23 15:48:53 2025 +0200

    [Security] Fix: Structurally dead code (HabanaAI#1625)

    Remove dead code for security reason

    Signed-off-by: Artur Fierka <artur.fierka@intel.com>

commit 33fbed4
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Tue Jul 22 12:49:42 2025 +0200

    Update sha - Port: Fix fallback bucket (HabanaAI#1626)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 1b46f4c
Author: Seunghyuk Park (shepark) <seunghyuk.h.park@intel.com>
Date:   Tue Jul 22 00:52:50 2025 -0700

    Embedding fix: warmup failure in embedding model (HabanaAI#1510) (HabanaAI#1559)

    Merge changes from habana_main for embedding fix
    HabanaAI#1510

    ---- details ----
    Fix the failures at warmup stage in pooling mode

    --
    due to.
    [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line
    2904, in warmup_model
    [rank0]: self.warmup_graphs(
    [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line
    2714, in warmup_graphs
    [rank0]: self.warmup_scenario(batch_size,
    [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line
    2561, in warmup_scenario
    [rank0]: inputs = self.prepare_model_input_align_worker( [rank0]: File
    "/wm/vllm-fork/vllm/worker/model_runner_base.py", line 233, in
    prepare_model_input_align_worker
    [rank0]: raise NotImplementedError
    [rank0]: NotImplementedError

    Co-authored-by: Libin Tang <litang@habana.ai>
    Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

commit 062f345
Author: Karol Damaszke <kdamaszke@habana.ai>
Date:   Fri Jul 18 17:02:42 2025 +0200

    Fix text-only prompt in Llama Vision (HabanaAI#1621)

    Fixes text-only prompts in Llama Vision. Without setting
    `max_encoder_seq_lens` we are not skipping `cross_attention` for
    text-only prompts, which results in None's `key` and `value`.

    Signed-off-by: Karol Damaszke <kdamaszke@habana.ai>

commit 449fa92
Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>
Date:   Thu Jul 17 15:44:56 2025 +0200

    docker vllm: update readme (HabanaAI#1596)

    docker vllm: update readme

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>

commit 22ee396
Author: Michal Adamczyk <michal.adamczyk@intel.com>
Date:   Thu Jul 17 09:44:10 2025 +0200

    [1.22] Set vllm-hpu-extension to 22abb7a (HabanaAI#1611)

commit 37888b5
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Thu Jul 17 07:11:00 2025 +0200

    Port: V1 - dont look for bucket we know don't exists (HabanaAI#1606) (HabanaAI#1608)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 18d51d1
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Wed Jul 16 16:29:47 2025 +0200

    Readme update - Dont use apc on v0 (HabanaAI#1607)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 9b1675c
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Wed Jul 16 13:43:59 2025 +0200

    Port: Num blocks fix - V1 (HabanaAI#1594) (HabanaAI#1601)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit bdd9171
Author: Yi Liu <yi4.liu@intel.com>
Date:   Tue Jul 15 18:43:49 2025 +0800

    Update Force Channel FP8 Check (HabanaAI#1563)

    Porting HabanaAI#1561

    Signed-off-by: yiliu30 <yi4.liu@intel.com>

commit 23e63c0
Author: liuzhenwei <zhenwei.liu@intel.com>
Date:   Tue Jul 15 16:06:19 2025 +0800

    [V0] Use device as the set_device's parameter by default, update proxy (HabanaAI#1582)

    https://jira.habana-labs.com/browse/SW-234257
    cherry-pick from HabanaAI#1540

    Signed-off-by: zhenwei <zhenweiliu@habana.ai>
    Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

commit 82fc060
Author: Iryna Boiko <iboiko@habana.ai>
Date:   Mon Jul 14 15:58:24 2025 +0200

    Change vllm-hpu-extension revision to 89515f6 (HabanaAI#1584)

    Change vllm-hpu-extension revision to 89515f6

commit 47768d3
Author: Iryna Boiko <iboiko@habana.ai>
Date:   Mon Jul 14 15:18:30 2025 +0200

    Port: temporarely disable deepseek test HabanaAI#1535 (HabanaAI#1586)

    Port: Update hpu-ext sha and temporarely disable deepseek test HabanaAI#1535

commit f1c70dc
Author: Michał Kuligowski <mkuligowski@habana.ai>
Date:   Mon Jul 14 14:57:57 2025 +0200

    Fix AttributeError: 'NoneType' object has no attribute 'getenv' (HabanaAI#1555)

    Fixes
    AttributeError: 'NoneType' object has no attribute 'getenv'
    during tests teardown

commit 617498a
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Mon Jul 14 14:35:07 2025 +0200

    Readme warmup update (HabanaAI#1512) (HabanaAI#1585)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 8bb429d
Author: Tomasz Pawlowski <tpawlowski@habana.ai>
Date:   Fri Jul 11 20:21:57 2025 +0200

    Add accelerate to requirements/hpu.txt (HabanaAI#1564) (v1.22.0) (HabanaAI#1566)

    Cherry picked from HabanaAI#1564

    Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>

commit aca2ddc
Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>
Date:   Fri Jul 11 12:58:11 2025 +0200

    docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct (HabanaAI#1569)

    docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct

    ---------

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>

commit 512caed
Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>
Date:   Thu Jul 10 08:12:39 2025 +0200

    docker vllm: cleanup configs and add missing models (HabanaAI#1548)

    docker vllm: cleanup configs and add missing models

    ---------

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>

commit 7b69f70
Author: PatW <patryk.wolsza@intel.com>
Date:   Tue Jul 8 13:56:23 2025 +0200

    Cherrypick docker vllm: update readme (HabanaAI#1525) (HabanaAI#1538)

    Cherry pick of the docker vllm: update readme from habana_main

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>
    Signed-off-by: Artur Fierka <artur.fierka@intel.com>
    Co-authored-by: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>

commit 79ef0d5
Author: Michal Szutenberg <michal.szutenberg@intel.com>
Date:   Tue Jul 8 12:39:00 2025 +0200

    [SW-234006] Fix requirements (1.22.0) (HabanaAI#1530)

    See
    https://jira.habana-labs.com/browse/SW-234006?focusedId=1073396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1073396
tianyuan211 added a commit to tianyuan211/vllm-fork that referenced this pull request Aug 14, 2025
commit 95f5008
Author: Wei Lin <forever871001@163.com>
Date:   Wed Aug 13 20:46:59 2025 +0800

    Porting DeeSeek v2/r1 PRs (HabanaAI#1756)

    ## Essential Elements of an Effective PR Description Checklist
    - [ ] The purpose of the PR, such as "Fix some issue (link existing
    issues this PR will resolve)".
    - [ ] The test plan, such as providing test command.
    - [ ] The test results, such as pasting the results comparison before
    and after, or e2e results

    ## Porting List
    1. HabanaAI#1402
    2. HabanaAI#1504
    3. HabanaAI#1404

    <!--- pyml disable-next-line no-emphasis-as-heading -->

commit fd41376
Author: Bob Zhu <bob.zhu@intel.com>
Date:   Wed Aug 13 16:21:20 2025 +0800

    link to the correct vllm-hpu-extention branch (HabanaAI#1755)

    The vllm-fork aice/v1.22.0 branch will always use vllm-hpu-extention
    aice/v1.22.0 branch

commit 6693645
Author: Wei Lin <forever871001@163.com>
Date:   Wed Aug 13 15:19:09 2025 +0800

    Add profiler for HPU (HabanaAI#1753)

    ## Essential Elements of an Effective PR Description Checklist
    - [ ] The purpose of the PR, such as "Fix some issue (link existing
    issues this PR will resolve)".
    - [ ] The test plan, such as providing test command.
    - [ ] The test results, such as pasting the results comparison before
    and after, or e2e results

    ## Purpose

    ## Test Plan

    ## Test Result

    <!--- pyml disable-next-line no-emphasis-as-heading -->

commit 26d4308
Author: Katarzyna Fojcik <katarzyna.fojcik@intel.com>
Date:   Tue Aug 12 15:06:01 2025 +0200

    [SW-234805] Fix target_device for weights load (HabanaAI#1734)

    ## Essential Elements of an Effective PR Description Checklist
    - [x] The purpose of the PR, such as "Fix some issue (link existing
    issues this PR will resolve)".
    - [ ] The test plan, such as providing test command.
    - [ ] The test results, such as pasting the results comparison before
    and after, or e2e results

    ## Purpose
    With
    HabanaAI@4d5ee6c
    Rebase 0.9.0.1 (HabanaAI#1507) the
    loading way has changed. Target_device during loading model and weights
    should depend on load_device config as it was before, so the flag
    --weights-load-device=cpu is functional.

    ## Test Plan
    One of the affected tests: llama-31-70b-fp8-1x-gaudi3

    ## Test Result
    PASSED:
    https://qa-jenkins-ctrl03.habana-labs.com/job/qa_jobs/job/qa_testers/job/gdn-qa/job/pytorch/job/gaudi3/job/continous_batching/job/VLLM/job/Native/job/llama-31-70b-fp8-1x-gaudi3-native_benchmark_throughput_VLLM_pytorch_gaudi3-ank8s_v1_22_0/108/

    <!--- pyml disable-next-line no-emphasis-as-heading -->

    Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai>
    Co-authored-by: Michal Gawarkiewicz <michal.gawarkiewicz@intel.com>

commit 4118669
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Tue Aug 12 13:25:15 2025 +0200

    Fix merged prefill with new bucketing manager (HabanaAI#1746)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
    Signed-off-by: root <root@adobrzyniewicz-sn76-g3-mpijob-worker-0.adobrzyniewicz-sn76-g3-mpijob-worker.framework.svc.cluster.local>
    Co-authored-by: root <root@adobrzyniewicz-sn76-g3-mpijob-worker-0.adobrzyniewicz-sn76-g3-mpijob-worker.framework.svc.cluster.local>

commit d837abf
Author: PatW <patryk.wolsza@intel.com>
Date:   Tue Aug 12 11:35:37 2025 +0200

    1.22 release readme change (HabanaAI#1718)

    Collection of changes in documentation for 1.22 release.

commit cb44d6f
Author: Jimin Ha <jimin.ha@intel.com>
Date:   Mon Aug 11 21:03:50 2025 -0700

    Fixes HPU graph run for Gemma3 vision inputs (HabanaAI#1719)

    Fixes HPU graph issues for gemma3 vision inputs

    - Text warmup to include attn_mask info, so vision+text data can reuse
    the graph for language model that's warmed up already.
    - Changing slicing to index_select for multimodal bucketing for HPU.
    Slicing doesn't produce the same hash for the HPU graph with same input
    shape.
    -    Use buckets for the vision tower as well to reduce GC recompile
    - Accuracy bug fix by clone output data of the multimodal-projector.

    Validated with Muirbench datasets.

commit 2d550e4
Author: Chendi.Xue <chendi.xue@intel.com>
Date:   Thu Aug 7 07:16:45 2025 -0500

    skip softmax/log_softmax when greedy_sampling with no logprobs (HabanaAI#1706)

    ## Essential Elements of an Effective PR Description Checklist
    - [ ] The purpose of the PR, such as "Fix some issue (link existing
    issues this PR will resolve)".
    - [ ] The test plan, such as providing test command.
    - [ ] The test results, such as pasting the results comparison before
    and after, or e2e results

    ## Purpose

    If seq_groups with all sampling_type == Greedy, avoid
    log_softmax/softmax during sampling calculation

    ## Test Plan

    ## Test Result

    <!--- pyml disable-next-line no-emphasis-as-heading -->

    ---------

    Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
    Co-authored-by: attafosu <tattafosu@habana.ai>

commit 4008f53
Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>
Date:   Tue Aug 5 14:14:35 2025 +0200

    docker vllm: add VLLM_EXPONENTIAL_BUCKETING param (HabanaAI#1682)

    docker vllm: add VLLM_EXPONENTIAL_BUCKETING param

    ---------

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>

commit a12463c
Author: Chendi.Xue <chendi.xue@intel.com>
Date:   Mon Aug 4 03:09:56 2025 -0500

    [SW-235047] port PR 1629 to 1.22.0 - use w8a8 path for per_channel for performance regression fixing (HabanaAI#1644)

    HabanaAI#1629

    ---------

    Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
    Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
    Co-authored-by: Jan Kaniecki <jan.kaniecki@intel.com>

commit 0884eb4
Author: Jimin Ha <jimin.ha@intel.com>
Date:   Fri Aug 1 05:42:09 2025 -0700

    Gemma3 v1.22  changes (Sliding_Window feature  + few others) (HabanaAI#1660)

    This PR contains following changes
    1. Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a
    few extra fixes including..
    - Sliding FusedSDPA kernel, we are adding threshold variable to enable
    or disable to use optimized kernel. This kernel will be
    performance/memory benefit for longer sequence. We are providing
    environment variable to control per customer request.
    - Based on the threshold, choose different prompt bucket, if it's
    smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use
    SLICE_SIZE.
     - Added mark_step before SLIDING FusedSDPA is run.
     - Misc fixes for bucket related issue.
     2. upstream fixes
     vllm-project#18732
    vllm-project#21479
    vllm-project#19788

    3. optimized Gemma3RMSNorm with FusedRMSNorm
    Dependent on HabanaAI#1647

    Run command with.
    VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false
    VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024
    VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1

    ---------

    Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
    Signed-off-by: Hongmin Fan <fanhongmin@google.com>
    Co-authored-by: Henry Tang <ytang@habana.ai>
    Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
    Co-authored-by: Shiv Kaul <skaul@habana.ai>
    Co-authored-by: Shiv Kaul <shiv.kaul@intel.com>
    Co-authored-by: Libin Tang <libin.tang@intel.com>
    Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
    Co-authored-by: Hongmin Fan <fanhongmin@google.com>
    Co-authored-by: Harish Subramony <hsubramony@habana.ai>
    Co-authored-by: Jianhong-Zhang <jianhong.zhang@intel.com>
    Co-authored-by: Libin Tang <litang@habana.ai>
    Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

commit 065fde3
Author: Jan Kaniecki <jan.kaniecki@intel.com>
Date:   Thu Jul 31 15:42:13 2025 +0200

    Remove inference_mode() from platforms.hpu (HabanaAI#1690)

    Inference_mode() is causing recompilations with t.compile - we don't
    need it as we already put inference_mode on particular functions in
    model runner. It was introduced by Rebase 0.9.0.1
    (HabanaAI#1507) - previously we didn't
    have such call.

commit 7d6528e
Author: Krzysztof Smusz <ksmusz@habana.ai>
Date:   Wed Jul 30 12:19:34 2025 +0200

    Set hpu-extension to 61dafb3 (HabanaAI#1683)

    Upgrading vllm-hpu-extension with change introducing the fix for
    unsupported block_softmax_adjustment in fp16 precision

commit ff9bff9
Author: Iryna Boiko <iboiko@habana.ai>
Date:   Tue Jul 29 09:19:29 2025 +0200

    Remove dtype.float16 support for hpu config (HabanaAI#1650)

commit 034c756
Author: Chendi.Xue <chendi.xue@intel.com>
Date:   Tue Jul 29 02:17:44 2025 -0500

    [SW-234344] Fix 'RotaryEmbedding' object has no attribute 'sin' (HabanaAI#1659)

    ## Essential Elements of an Effective PR Description Checklist
    - [x] The purpose of the PR, such as "Fix some issue (link existing
    issues this PR will resolve)".
    - [ ] The test plan, such as providing test command.
    - [ ] The test results, such as pasting the results comparison before
    and after, or e2e results

    ## Purpose

    port commit from HabanaAI#1658 for fixing SW-234344 for habana_main

    ## Test Plan

    ## Test Result

    <!--- pyml disable-next-line no-emphasis-as-heading -->

    Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

commit e5a6120
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Tue Jul 29 08:53:48 2025 +0200

    1.22 Warmup one context more - linear - Update sha extension (HabanaAI#1655)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
    Co-authored-by: Jan Kaniecki <jan.kaniecki@intel.com>

commit 9957ca7
Author: Michał Kuligowski <mkuligowski@habana.ai>
Date:   Tue Jul 29 08:52:48 2025 +0200

    ValueError: 'aimv2' is already used by a Transformers config, pick an… (HabanaAI#1673)

    Fix cherrypicked from upstream
    https://github.com/vllm-project/vllm/pull/20921/files

commit f1b60b4
Author: Mohit Deopujari <mdeopujari@habana.ai>
Date:   Thu Jul 24 08:07:04 2025 -0700

    Gemma3 suppport: propogation : pr1589/1597/1558 to v1.22.0_next (HabanaAI#1616)

    Added support for FusedSDPA kernel with window_size for Gemma3.
    This PR relies on vllm-hpu-extension
    [PR302](HabanaAI/vllm-hpu-extension#302)

    ---------

    Co-authored-by: Shiv Kaul <skaul@habana.ai>
    Co-authored-by: Shiv Kaul <shiv.kaul@intel.com>
    Co-authored-by: Jimin Ha <jimin.ha@intel.com>
    Co-authored-by: Henry Tang <ytang@habana.ai>
    Co-authored-by: Libin Tang <litang@habana.ai>
    Co-authored-by: Libin Tang <libin.tang@intel.com>
    Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

commit 59b8f75
Author: Artur Fierka <artur.fierka@intel.com>
Date:   Thu Jul 24 13:11:57 2025 +0200

    Update hpu.txt on 1.22.0 branch (HabanaAI#1648)

    Set extension SHA for Port: Fix: Round up to sliding window threshold
    HabanaAI#307 (HabanaAI#309)

commit d6b00f4
Author: Artur Fierka <artur.fierka@intel.com>
Date:   Wed Jul 23 15:50:14 2025 +0200

    [Security] Fix: Bad use of null-like value (HabanaAI#1634)

    Signed-off-by: Artur Fierka <artur.fierka@intel.com>

commit 66858d6
Author: Artur Fierka <artur.fierka@intel.com>
Date:   Wed Jul 23 15:48:53 2025 +0200

    [Security] Fix: Structurally dead code (HabanaAI#1625)

    Remove dead code for security reason

    Signed-off-by: Artur Fierka <artur.fierka@intel.com>

commit 33fbed4
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Tue Jul 22 12:49:42 2025 +0200

    Update sha - Port: Fix fallback bucket (HabanaAI#1626)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 1b46f4c
Author: Seunghyuk Park (shepark) <seunghyuk.h.park@intel.com>
Date:   Tue Jul 22 00:52:50 2025 -0700

    Embedding fix: warmup failure in embedding model (HabanaAI#1510) (HabanaAI#1559)

    Merge changes from habana_main for embedding fix
    HabanaAI#1510

    ---- details ----
    Fix the failures at warmup stage in pooling mode

    --
    due to.
    [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line
    2904, in warmup_model
    [rank0]: self.warmup_graphs(
    [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line
    2714, in warmup_graphs
    [rank0]: self.warmup_scenario(batch_size,
    [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line
    2561, in warmup_scenario
    [rank0]: inputs = self.prepare_model_input_align_worker( [rank0]: File
    "/wm/vllm-fork/vllm/worker/model_runner_base.py", line 233, in
    prepare_model_input_align_worker
    [rank0]: raise NotImplementedError
    [rank0]: NotImplementedError

    Co-authored-by: Libin Tang <litang@habana.ai>
    Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

commit 062f345
Author: Karol Damaszke <kdamaszke@habana.ai>
Date:   Fri Jul 18 17:02:42 2025 +0200

    Fix text-only prompt in Llama Vision (HabanaAI#1621)

    Fixes text-only prompts in Llama Vision. Without setting
    `max_encoder_seq_lens` we are not skipping `cross_attention` for
    text-only prompts, which results in None's `key` and `value`.

    Signed-off-by: Karol Damaszke <kdamaszke@habana.ai>

commit 449fa92
Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>
Date:   Thu Jul 17 15:44:56 2025 +0200

    docker vllm: update readme (HabanaAI#1596)

    docker vllm: update readme

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>

commit 22ee396
Author: Michal Adamczyk <michal.adamczyk@intel.com>
Date:   Thu Jul 17 09:44:10 2025 +0200

    [1.22] Set vllm-hpu-extension to 22abb7a (HabanaAI#1611)

commit 37888b5
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Thu Jul 17 07:11:00 2025 +0200

    Port: V1 - dont look for bucket we know don't exists (HabanaAI#1606) (HabanaAI#1608)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 18d51d1
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Wed Jul 16 16:29:47 2025 +0200

    Readme update - Dont use apc on v0 (HabanaAI#1607)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 9b1675c
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Wed Jul 16 13:43:59 2025 +0200

    Port: Num blocks fix - V1 (HabanaAI#1594) (HabanaAI#1601)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit bdd9171
Author: Yi Liu <yi4.liu@intel.com>
Date:   Tue Jul 15 18:43:49 2025 +0800

    Update Force Channel FP8 Check (HabanaAI#1563)

    Porting HabanaAI#1561

    Signed-off-by: yiliu30 <yi4.liu@intel.com>

commit 23e63c0
Author: liuzhenwei <zhenwei.liu@intel.com>
Date:   Tue Jul 15 16:06:19 2025 +0800

    [V0] Use device as the set_device's parameter by default, update proxy (HabanaAI#1582)

    https://jira.habana-labs.com/browse/SW-234257
    cherry-pick from HabanaAI#1540

    Signed-off-by: zhenwei <zhenweiliu@habana.ai>
    Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

commit 82fc060
Author: Iryna Boiko <iboiko@habana.ai>
Date:   Mon Jul 14 15:58:24 2025 +0200

    Change vllm-hpu-extension revision to 89515f6 (HabanaAI#1584)

    Change vllm-hpu-extension revision to 89515f6

commit 47768d3
Author: Iryna Boiko <iboiko@habana.ai>
Date:   Mon Jul 14 15:18:30 2025 +0200

    Port: temporarely disable deepseek test HabanaAI#1535 (HabanaAI#1586)

    Port: Update hpu-ext sha and temporarely disable deepseek test HabanaAI#1535

commit f1c70dc
Author: Michał Kuligowski <mkuligowski@habana.ai>
Date:   Mon Jul 14 14:57:57 2025 +0200

    Fix AttributeError: 'NoneType' object has no attribute 'getenv' (HabanaAI#1555)

    Fixes
    AttributeError: 'NoneType' object has no attribute 'getenv'
    during tests teardown

commit 617498a
Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Date:   Mon Jul 14 14:35:07 2025 +0200

    Readme warmup update (HabanaAI#1512) (HabanaAI#1585)

    Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

commit 8bb429d
Author: Tomasz Pawlowski <tpawlowski@habana.ai>
Date:   Fri Jul 11 20:21:57 2025 +0200

    Add accelerate to requirements/hpu.txt (HabanaAI#1564) (v1.22.0) (HabanaAI#1566)

    Cherry picked from HabanaAI#1564

    Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>

commit aca2ddc
Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>
Date:   Fri Jul 11 12:58:11 2025 +0200

    docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct (HabanaAI#1569)

    docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct

    ---------

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>

commit 512caed
Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>
Date:   Thu Jul 10 08:12:39 2025 +0200

    docker vllm: cleanup configs and add missing models (HabanaAI#1548)

    docker vllm: cleanup configs and add missing models

    ---------

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>

commit 7b69f70
Author: PatW <patryk.wolsza@intel.com>
Date:   Tue Jul 8 13:56:23 2025 +0200

    Cherrypick docker vllm: update readme (HabanaAI#1525) (HabanaAI#1538)

    Cherry pick of the docker vllm: update readme from habana_main

    Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai>
    Signed-off-by: Artur Fierka <artur.fierka@intel.com>
    Co-authored-by: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com>

commit 79ef0d5
Author: Michal Szutenberg <michal.szutenberg@intel.com>
Date:   Tue Jul 8 12:39:00 2025 +0200

    [SW-234006] Fix requirements (1.22.0) (HabanaAI#1530)

    See
    https://jira.habana-labs.com/browse/SW-234006?focusedId=1073396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1073396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants