UPSTREAM PR #18295: vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #714

loci-dev · 2025-12-26T19:33:30Z

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       269.32 ± 13.22 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        260.52 ± 1.17 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        267.10 ± 5.18 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       340.67 ± 22.33 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        356.88 ± 9.24 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       333.40 ± 12.02 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       288.13 ± 13.10 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        284.81 ± 2.36 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        289.09 ± 3.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       343.03 ± 19.78 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        355.02 ± 4.88 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        353.27 ± 0.69 |

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

loci-agentic-ai · 2025-12-26T20:18:14Z

Explore the complete analysis inside the Version Insights

I've generated a summary report for your project. The report shows performance analysis for pull request #714 in the llama.cpp repository (owned by auroralabs-loci), comparing two versions of the code.

Key highlights:

Response times improved by 6.78% to 12.32% for the affected functions
Throughput decreased by 9.05% to 15.61% for the same functions
The changes primarily affect standard library operations (vector allocation and swap functions)

The report suggests that while individual operations are faster, there may be some trade-offs in terms of overall throughput that warrant further investigation.

loci-dev temporarily deployed to PROD__AL_DEMO December 26, 2025 19:33 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from f7cdf84 to 701f648 Compare December 26, 2025 23:08

change test_topk_moe to allow results in arbitrary order

797b4ef

loci-dev force-pushed the upstream-PR18295-branch_jeffbolznv-topk_moe_sigmoid_bias branch from 75bcc84 to bfbd40e Compare December 27, 2025 02:14

loci-dev had a problem deploying to PROD__AL_DEMO December 27, 2025 02:15 — with GitHub Actions Failure

loci-dev force-pushed the upstream-PR18295-branch_jeffbolznv-topk_moe_sigmoid_bias branch from bfbd40e to 03b18c9 Compare December 27, 2025 03:03

loci-dev had a problem deploying to PROD__AL_DEMO December 27, 2025 03:03 — with GitHub Actions Failure

disable sigmoid fusion for moltenvk

86df563

loci-dev force-pushed the upstream-PR18295-branch_jeffbolznv-topk_moe_sigmoid_bias branch from 03b18c9 to 86df563 Compare December 27, 2025 03:49

loci-dev had a problem deploying to PROD__AL_DEMO December 27, 2025 03:49 — with GitHub Actions Failure

loci-dev force-pushed the main branch 13 times, most recently from 8645b59 to f2e8c7f Compare December 29, 2025 00:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #18295: vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #714

UPSTREAM PR #18295: vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #714

loci-dev commented Dec 26, 2025

Uh oh!

loci-agentic-ai bot commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #18295: vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #714

Are you sure you want to change the base?

UPSTREAM PR #18295: vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #714

Conversation

loci-dev commented Dec 26, 2025

Uh oh!

loci-agentic-ai bot commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants