Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution #11525

ORippler · 2025-07-25T13:51:53Z

This PR enables the execution of Gemma3n as CUDA Graphs on NVGPUs by porting ggml-org/llama.cpp#14741 to ollama. Since the model graph is defined differently in ollama compared to llama.cpp, the heuristic used to identify and exclude the per_layer_projection from batch-size determination needed to be modified a bit. As a consequence, the patch will need to be maintained even after llama.cpp is updated to a commit that contains ggml-org/llama.cpp#14741.

On a RTX PRO 6000 Max-Q under Windows, this PR improves perf by ~2.5x, see

Model	Configuration	Tokens/sec
gemma3n:e2b	CG ON	103
gemma3n:e2b	CG OFF	43
gemma3n:e4b	CG ON	79
gemma3n:e4b	CG OFF	35

Thanks @mxyng for providing changes to gemma3n model graph definition in c4de3ea that make the checking more robust.

Similar to ggml-org/llama.cpp#14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks.

This should make the heuristics more robust

mxyng

this is awesome! thanks for sharing

* origin/main: Revert "CI: switch back to x86 macos builder" (ollama#11588) mac: disable bf16 on unsupported OS versions (ollama#11585) CI: switch back to x86 macos builder (ollama#11572) Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (ollama#11525) kvcache: Don't shift empty batches docs: fix typos and remove trailing whitespaces (ollama#11554) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…aph execution (ollama#11525) * Enable CUDA Graphs for gemma3n. Similar to ggml-org/llama.cpp#14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks. * Remove residual check by reshaping differently in gemma3n model This should make the heuristics more robust

ORippler added 2 commits July 25, 2025 12:14

Enable CUDA Graphs for gemma3n.

a86286d

Similar to ggml-org/llama.cpp#14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks.

Remove residual check by reshaping differently in gemma3n model

c4de3ea

This should make the heuristics more robust

mxyng approved these changes Jul 29, 2025

View reviewed changes

mxyng merged commit ea85e27 into ollama:main Jul 29, 2025
15 of 16 checks passed

gabe-l-hart mentioned this pull request Jul 30, 2025

Granite four (llama.cpp bump 443e7e7+) #11195

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution #11525

Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution #11525

Uh oh!

ORippler commented Jul 25, 2025

Uh oh!

mxyng left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution #11525

Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution #11525

Uh oh!

Conversation

ORippler commented Jul 25, 2025

Uh oh!

mxyng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants