Fixes HPU graph run for Gemma3 vision inputs #1719

jiminha · 2025-08-06T19:36:41Z

Fixes HPU graph issues for gemma3 vision inputs

Text warmup to include attn_mask info, so vision+text data can reuse the graph for language model that's warmed up already.
Changing slicing to index_select for multimodal bucketing for HPU. Slicing doesn't produce the same hash for the HPU graph with same input shape.
Use buckets for the vision tower as well to reduce GC recompile
Accuracy bug fix by clone output data of the multimodal-projector.

Validated with Muirbench datasets.

-Text warmup to include mask info, so vision data can use the same graph for language model that's warmed up already. -Changing slicing to index_select for multimodal bucketing for HPU graph to work(otherwise hashkey is different)

Without this fix, when the bucket is 1(hpu graph is being reused), the output buffer iis reused across iterations causing accuracy issue.

…pugraphgemma3

mgawarkiewicz-intel · 2025-08-08T07:19:55Z

lgtm

xuechendi · 2025-08-08T17:09:55Z

/run-gaudi-tests

jiminha · 2025-08-08T19:20:14Z

Qwen test is failing due to TPC issue.
Issue will be fixed once this change comes in.
https://github.com/habana-internal/complex_guid_lib/pull/811.

xuechendi · 2025-08-11T16:20:52Z

/run-gaudi-tests

commit 95f5008 Author: Wei Lin <forever871001@163.com> Date: Wed Aug 13 20:46:59 2025 +0800 Porting DeeSeek v2/r1 PRs (HabanaAI#1756) ## Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ## Porting List 1. HabanaAI#1402 2. HabanaAI#1504 3. HabanaAI#1404  commit fd41376 Author: Bob Zhu <bob.zhu@intel.com> Date: Wed Aug 13 16:21:20 2025 +0800 link to the correct vllm-hpu-extention branch (HabanaAI#1755) The vllm-fork aice/v1.22.0 branch will always use vllm-hpu-extention aice/v1.22.0 branch commit 6693645 Author: Wei Lin <forever871001@163.com> Date: Wed Aug 13 15:19:09 2025 +0800 Add profiler for HPU (HabanaAI#1753) ## Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ## Purpose ## Test Plan ## Test Result  commit 26d4308 Author: Katarzyna Fojcik <katarzyna.fojcik@intel.com> Date: Tue Aug 12 15:06:01 2025 +0200 [SW-234805] Fix target_device for weights load (HabanaAI#1734) ## Essential Elements of an Effective PR Description Checklist - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ## Purpose With HabanaAI@4d5ee6c Rebase 0.9.0.1 (HabanaAI#1507) the loading way has changed. Target_device during loading model and weights should depend on load_device config as it was before, so the flag --weights-load-device=cpu is functional. ## Test Plan One of the affected tests: llama-31-70b-fp8-1x-gaudi3 ## Test Result PASSED: https://qa-jenkins-ctrl03.habana-labs.com/job/qa_jobs/job/qa_testers/job/gdn-qa/job/pytorch/job/gaudi3/job/continous_batching/job/VLLM/job/Native/job/llama-31-70b-fp8-1x-gaudi3-native_benchmark_throughput_VLLM_pytorch_gaudi3-ank8s_v1_22_0/108/  Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai> Co-authored-by: Michal Gawarkiewicz <michal.gawarkiewicz@intel.com> commit 4118669 Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Date: Tue Aug 12 13:25:15 2025 +0200 Fix merged prefill with new bucketing manager (HabanaAI#1746) Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: root <root@adobrzyniewicz-sn76-g3-mpijob-worker-0.adobrzyniewicz-sn76-g3-mpijob-worker.framework.svc.cluster.local> Co-authored-by: root <root@adobrzyniewicz-sn76-g3-mpijob-worker-0.adobrzyniewicz-sn76-g3-mpijob-worker.framework.svc.cluster.local> commit d837abf Author: PatW <patryk.wolsza@intel.com> Date: Tue Aug 12 11:35:37 2025 +0200 1.22 release readme change (HabanaAI#1718) Collection of changes in documentation for 1.22 release. commit cb44d6f Author: Jimin Ha <jimin.ha@intel.com> Date: Mon Aug 11 21:03:50 2025 -0700 Fixes HPU graph run for Gemma3 vision inputs (HabanaAI#1719) Fixes HPU graph issues for gemma3 vision inputs - Text warmup to include attn_mask info, so vision+text data can reuse the graph for language model that's warmed up already. - Changing slicing to index_select for multimodal bucketing for HPU. Slicing doesn't produce the same hash for the HPU graph with same input shape. - Use buckets for the vision tower as well to reduce GC recompile - Accuracy bug fix by clone output data of the multimodal-projector. Validated with Muirbench datasets. commit 2d550e4 Author: Chendi.Xue <chendi.xue@intel.com> Date: Thu Aug 7 07:16:45 2025 -0500 skip softmax/log_softmax when greedy_sampling with no logprobs (HabanaAI#1706) ## Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ## Purpose If seq_groups with all sampling_type == Greedy, avoid log_softmax/softmax during sampling calculation ## Test Plan ## Test Result  --------- Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: attafosu <tattafosu@habana.ai> commit 4008f53 Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com> Date: Tue Aug 5 14:14:35 2025 +0200 docker vllm: add VLLM_EXPONENTIAL_BUCKETING param (HabanaAI#1682) docker vllm: add VLLM_EXPONENTIAL_BUCKETING param --------- Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai> commit a12463c Author: Chendi.Xue <chendi.xue@intel.com> Date: Mon Aug 4 03:09:56 2025 -0500 [SW-235047] port PR 1629 to 1.22.0 - use w8a8 path for per_channel for performance regression fixing (HabanaAI#1644) HabanaAI#1629 --------- Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jan Kaniecki <jan.kaniecki@intel.com> commit 0884eb4 Author: Jimin Ha <jimin.ha@intel.com> Date: Fri Aug 1 05:42:09 2025 -0700 Gemma3 v1.22 changes (Sliding_Window feature + few others) (HabanaAI#1660) This PR contains following changes 1. Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a few extra fixes including.. - Sliding FusedSDPA kernel, we are adding threshold variable to enable or disable to use optimized kernel. This kernel will be performance/memory benefit for longer sequence. We are providing environment variable to control per customer request. - Based on the threshold, choose different prompt bucket, if it's smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use SLICE_SIZE. - Added mark_step before SLIDING FusedSDPA is run. - Misc fixes for bucket related issue. 2. upstream fixes vllm-project#18732 vllm-project#21479 vllm-project#19788 3. optimized Gemma3RMSNorm with FusedRMSNorm Dependent on HabanaAI#1647 Run command with. VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024 VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1 --------- Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: Hongmin Fan <fanhongmin@google.com> Co-authored-by: Henry Tang <ytang@habana.ai> Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Shiv Kaul <skaul@habana.ai> Co-authored-by: Shiv Kaul <shiv.kaul@intel.com> Co-authored-by: Libin Tang <libin.tang@intel.com> Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: Hongmin Fan <fanhongmin@google.com> Co-authored-by: Harish Subramony <hsubramony@habana.ai> Co-authored-by: Jianhong-Zhang <jianhong.zhang@intel.com> Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> commit 065fde3 Author: Jan Kaniecki <jan.kaniecki@intel.com> Date: Thu Jul 31 15:42:13 2025 +0200 Remove inference_mode() from platforms.hpu (HabanaAI#1690) Inference_mode() is causing recompilations with t.compile - we don't need it as we already put inference_mode on particular functions in model runner. It was introduced by Rebase 0.9.0.1 (HabanaAI#1507) - previously we didn't have such call. commit 7d6528e Author: Krzysztof Smusz <ksmusz@habana.ai> Date: Wed Jul 30 12:19:34 2025 +0200 Set hpu-extension to 61dafb3 (HabanaAI#1683) Upgrading vllm-hpu-extension with change introducing the fix for unsupported block_softmax_adjustment in fp16 precision commit ff9bff9 Author: Iryna Boiko <iboiko@habana.ai> Date: Tue Jul 29 09:19:29 2025 +0200 Remove dtype.float16 support for hpu config (HabanaAI#1650) commit 034c756 Author: Chendi.Xue <chendi.xue@intel.com> Date: Tue Jul 29 02:17:44 2025 -0500 [SW-234344] Fix 'RotaryEmbedding' object has no attribute 'sin' (HabanaAI#1659) ## Essential Elements of an Effective PR Description Checklist - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ## Purpose port commit from HabanaAI#1658 for fixing SW-234344 for habana_main ## Test Plan ## Test Result  Signed-off-by: Chendi.Xue <chendi.xue@intel.com> commit e5a6120 Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Date: Tue Jul 29 08:53:48 2025 +0200 1.22 Warmup one context more - linear - Update sha extension (HabanaAI#1655) Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Jan Kaniecki <jan.kaniecki@intel.com> commit 9957ca7 Author: Michał Kuligowski <mkuligowski@habana.ai> Date: Tue Jul 29 08:52:48 2025 +0200 ValueError: 'aimv2' is already used by a Transformers config, pick an… (HabanaAI#1673) Fix cherrypicked from upstream https://github.com/vllm-project/vllm/pull/20921/files commit f1b60b4 Author: Mohit Deopujari <mdeopujari@habana.ai> Date: Thu Jul 24 08:07:04 2025 -0700 Gemma3 suppport: propogation : pr1589/1597/1558 to v1.22.0_next (HabanaAI#1616) Added support for FusedSDPA kernel with window_size for Gemma3. This PR relies on vllm-hpu-extension [PR302](HabanaAI/vllm-hpu-extension#302) --------- Co-authored-by: Shiv Kaul <skaul@habana.ai> Co-authored-by: Shiv Kaul <shiv.kaul@intel.com> Co-authored-by: Jimin Ha <jimin.ha@intel.com> Co-authored-by: Henry Tang <ytang@habana.ai> Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: Libin Tang <libin.tang@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> commit 59b8f75 Author: Artur Fierka <artur.fierka@intel.com> Date: Thu Jul 24 13:11:57 2025 +0200 Update hpu.txt on 1.22.0 branch (HabanaAI#1648) Set extension SHA for Port: Fix: Round up to sliding window threshold HabanaAI#307 (HabanaAI#309) commit d6b00f4 Author: Artur Fierka <artur.fierka@intel.com> Date: Wed Jul 23 15:50:14 2025 +0200 [Security] Fix: Bad use of null-like value (HabanaAI#1634) Signed-off-by: Artur Fierka <artur.fierka@intel.com> commit 66858d6 Author: Artur Fierka <artur.fierka@intel.com> Date: Wed Jul 23 15:48:53 2025 +0200 [Security] Fix: Structurally dead code (HabanaAI#1625) Remove dead code for security reason Signed-off-by: Artur Fierka <artur.fierka@intel.com> commit 33fbed4 Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Date: Tue Jul 22 12:49:42 2025 +0200 Update sha - Port: Fix fallback bucket (HabanaAI#1626) Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> commit 1b46f4c Author: Seunghyuk Park (shepark) <seunghyuk.h.park@intel.com> Date: Tue Jul 22 00:52:50 2025 -0700 Embedding fix: warmup failure in embedding model (HabanaAI#1510) (HabanaAI#1559) Merge changes from habana_main for embedding fix HabanaAI#1510 ---- details ---- Fix the failures at warmup stage in pooling mode -- due to. [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2904, in warmup_model [rank0]: self.warmup_graphs( [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2714, in warmup_graphs [rank0]: self.warmup_scenario(batch_size, [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2561, in warmup_scenario [rank0]: inputs = self.prepare_model_input_align_worker( [rank0]: File "/wm/vllm-fork/vllm/worker/model_runner_base.py", line 233, in prepare_model_input_align_worker [rank0]: raise NotImplementedError [rank0]: NotImplementedError Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> commit 062f345 Author: Karol Damaszke <kdamaszke@habana.ai> Date: Fri Jul 18 17:02:42 2025 +0200 Fix text-only prompt in Llama Vision (HabanaAI#1621) Fixes text-only prompts in Llama Vision. Without setting `max_encoder_seq_lens` we are not skipping `cross_attention` for text-only prompts, which results in None's `key` and `value`. Signed-off-by: Karol Damaszke <kdamaszke@habana.ai> commit 449fa92 Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com> Date: Thu Jul 17 15:44:56 2025 +0200 docker vllm: update readme (HabanaAI#1596) docker vllm: update readme Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai> commit 22ee396 Author: Michal Adamczyk <michal.adamczyk@intel.com> Date: Thu Jul 17 09:44:10 2025 +0200 [1.22] Set vllm-hpu-extension to 22abb7a (HabanaAI#1611) commit 37888b5 Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Date: Thu Jul 17 07:11:00 2025 +0200 Port: V1 - dont look for bucket we know don't exists (HabanaAI#1606) (HabanaAI#1608) Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> commit 18d51d1 Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Date: Wed Jul 16 16:29:47 2025 +0200 Readme update - Dont use apc on v0 (HabanaAI#1607) Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> commit 9b1675c Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Date: Wed Jul 16 13:43:59 2025 +0200 Port: Num blocks fix - V1 (HabanaAI#1594) (HabanaAI#1601) Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> commit bdd9171 Author: Yi Liu <yi4.liu@intel.com> Date: Tue Jul 15 18:43:49 2025 +0800 Update Force Channel FP8 Check (HabanaAI#1563) Porting HabanaAI#1561 Signed-off-by: yiliu30 <yi4.liu@intel.com> commit 23e63c0 Author: liuzhenwei <zhenwei.liu@intel.com> Date: Tue Jul 15 16:06:19 2025 +0800 [V0] Use device as the set_device's parameter by default, update proxy (HabanaAI#1582) https://jira.habana-labs.com/browse/SW-234257 cherry-pick from HabanaAI#1540 Signed-off-by: zhenwei <zhenweiliu@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> commit 82fc060 Author: Iryna Boiko <iboiko@habana.ai> Date: Mon Jul 14 15:58:24 2025 +0200 Change vllm-hpu-extension revision to 89515f6 (HabanaAI#1584) Change vllm-hpu-extension revision to 89515f6 commit 47768d3 Author: Iryna Boiko <iboiko@habana.ai> Date: Mon Jul 14 15:18:30 2025 +0200 Port: temporarely disable deepseek test HabanaAI#1535 (HabanaAI#1586) Port: Update hpu-ext sha and temporarely disable deepseek test HabanaAI#1535 commit f1c70dc Author: Michał Kuligowski <mkuligowski@habana.ai> Date: Mon Jul 14 14:57:57 2025 +0200 Fix AttributeError: 'NoneType' object has no attribute 'getenv' (HabanaAI#1555) Fixes AttributeError: 'NoneType' object has no attribute 'getenv' during tests teardown commit 617498a Author: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Date: Mon Jul 14 14:35:07 2025 +0200 Readme warmup update (HabanaAI#1512) (HabanaAI#1585) Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> commit 8bb429d Author: Tomasz Pawlowski <tpawlowski@habana.ai> Date: Fri Jul 11 20:21:57 2025 +0200 Add accelerate to requirements/hpu.txt (HabanaAI#1564) (v1.22.0) (HabanaAI#1566) Cherry picked from HabanaAI#1564 Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> commit aca2ddc Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com> Date: Fri Jul 11 12:58:11 2025 +0200 docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct (HabanaAI#1569) docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct --------- Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai> commit 512caed Author: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com> Date: Thu Jul 10 08:12:39 2025 +0200 docker vllm: cleanup configs and add missing models (HabanaAI#1548) docker vllm: cleanup configs and add missing models --------- Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai> commit 7b69f70 Author: PatW <patryk.wolsza@intel.com> Date: Tue Jul 8 13:56:23 2025 +0200 Cherrypick docker vllm: update readme (HabanaAI#1525) (HabanaAI#1538) Cherry pick of the docker vllm: update readme from habana_main Signed-off-by: Tomasz Thaddey <tthaddey@habana.ai> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Tomasz Thaddey <76682475+tthaddey@users.noreply.github.com> commit 79ef0d5 Author: Michal Szutenberg <michal.szutenberg@intel.com> Date: Tue Jul 8 12:39:00 2025 +0200 [SW-234006] Fix requirements (1.22.0) (HabanaAI#1530) See https://jira.habana-labs.com/browse/SW-234006?focusedId=1073396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1073396

jiminha added 4 commits August 2, 2025 15:35

Remove bucket check for use_graph for multimodal

5c486ed

Fixes HPU graph issues

d0d16b1

-Text warmup to include mask info, so vision data can use the same graph for language model that's warmed up already. -Changing slicing to index_select for multimodal bucketing for HPU graph to work(otherwise hashkey is different)

Only generate mask for prompt warmup

3baea28

Add vision tower to the bucketing

62d176c

jiminha requested review from kzawora-intel, madamczyk-intel, michalkuligowski, mgawarkiewicz-intel, vivekgoe, afierka-intel, xuechendi, jikunshang, mswiniarsk and PatrykWo as code owners August 6, 2025 19:36

jiminha marked this pull request as draft August 6, 2025 19:36

jiminha added 3 commits August 7, 2025 17:20

Accuracy fix due to HPU graph output

15960f3

Without this fix, when the bucket is 1(hpu graph is being reused), the output buffer iis reused across iterations causing accuracy issue.

pre-commit fix, and remove logs

85e6eb9

Merge remote-tracking branch 'remotes/origin/v1.22.0_next' into jha/h…

d055dea

…pugraphgemma3

jiminha marked this pull request as ready for review August 8, 2025 01:00

mgawarkiewicz-intel approved these changes Aug 8, 2025

View reviewed changes

Fix the test error

4adebbe

wpyszka merged commit cb44d6f into v1.22.0_next Aug 12, 2025
52 checks passed

wpyszka deleted the jha/hpugraphgemma3 branch August 12, 2025 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes HPU graph run for Gemma3 vision inputs #1719

Fixes HPU graph run for Gemma3 vision inputs #1719

Uh oh!

jiminha commented Aug 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

mgawarkiewicz-intel commented Aug 8, 2025

Uh oh!

xuechendi commented Aug 8, 2025

Uh oh!

jiminha commented Aug 8, 2025

Uh oh!

xuechendi commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Fixes HPU graph run for Gemma3 vision inputs #1719

Fixes HPU graph run for Gemma3 vision inputs #1719

Uh oh!

Conversation

jiminha commented Aug 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgawarkiewicz-intel commented Aug 8, 2025

Uh oh!

xuechendi commented Aug 8, 2025

Uh oh!

jiminha commented Aug 8, 2025

Uh oh!

xuechendi commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

jiminha commented Aug 6, 2025 •

edited by github-actions bot

Loading