hexagon: eliminate scalar VTCM loads via HVX splat helpers by trivikram-reddy1 · Pull Request #22993 · ggml-org/llama.cpp

trivikram-reddy1 · 2026-05-12T23:44:33Z

Overview

Scalar loads from VTCM are expensive on Hexagon. This PR removes scalar VTCM loads in matmul and flash attention, replacing them with HVX vector loads + splat (vdelta) operations.

Additional information

Add hvx_vec_repl helpers and use those for splat-from-vtcm usecase
Weight dequantization is significantly faster on the matmul path — the per-group scale stage was a measurable bottleneck before.
Flash attention slope handling is no longer gated on a scalar VTCM read.

Results from Snapdragon 8elite Gen 5

unsloth/Qwen3-4B-GGUF/Qwen3-4B-Q4_0.gguf (44% improvement in prefill TPS or TTFT)
Before:
prompt eval time = 2909.89 ms / 946 tokens ( 3.08 ms per token, 325.10 tokens per second)
eval time = 3179.95 ms / 63 runs ( 50.48 ms per token, 19.81 tokens per second)

After:
prompt eval time = 2020.47 ms / 946 tokens ( 2.14 ms per token, 468.21 tokens per second)
eval time = 3213.10 ms / 63 runs ( 51.00 ms per token, 19.61 tokens per second)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

…secase

trivikram-reddy1 · 2026-05-12T23:48:41Z

@max-krasnyansky @lhez. could you please review this PR.

…22993) * hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access where possible in hmx-utils * hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

max-krasnyansky and others added 5 commits May 11, 2026 16:29

hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm u…

219d281

…secase

hmx-mm: optimize per-group scale handling

2448526

hmx-fa: optimize slope load from vtcm

ce799b9

hmx-fa: use aligned access where possible in hmx-utils

e6171dd

hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers

72ebe04

trivikram-reddy1 requested a review from a team as a code owner May 12, 2026 23:44

github-actions Bot added script Script related ggml changes relating to the ggml tensor library for machine learning Hexagon labels May 12, 2026

lhez approved these changes May 12, 2026

View reviewed changes

max-krasnyansky approved these changes May 13, 2026

View reviewed changes

max-krasnyansky merged commit 856c3ad into ggml-org:master May 13, 2026
47 of 50 checks passed

trivikram-reddy1 deleted the tr/hvx-splat-vtcm branch May 14, 2026 15:40

njsyw1997 added a commit to aizip/llama.cpp that referenced this pull request May 20, 2026

hexagon: apply repl optimization in flash attn softmax as ggml-org#22993

f0e4150

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: eliminate scalar VTCM loads via HVX splat helpers#22993

hexagon: eliminate scalar VTCM loads via HVX splat helpers#22993
max-krasnyansky merged 5 commits into
ggml-org:masterfrom
qualcomm:tr/hvx-splat-vtcm

trivikram-reddy1 commented May 12, 2026 •

edited

Loading

Uh oh!

trivikram-reddy1 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

trivikram-reddy1 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

trivikram-reddy1 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trivikram-reddy1 commented May 12, 2026 •

edited

Loading