Skip to content

Conversation

@ORippler
Copy link
Contributor

This PR enables the execution of Gemma3n as CUDA Graphs on NVGPUs by porting ggml-org/llama.cpp#14741 to ollama. Since the model graph is defined differently in ollama compared to llama.cpp, the heuristic used to identify and exclude the per_layer_projection from batch-size determination needed to be modified a bit. As a consequence, the patch will need to be maintained even after llama.cpp is updated to a commit that contains ggml-org/llama.cpp#14741.

On a RTX PRO 6000 Max-Q under Windows, this PR improves perf by ~2.5x, see

Model Configuration Tokens/sec
gemma3n:e2b CG ON 103
gemma3n:e2b CG OFF 43
gemma3n:e4b CG ON 79
gemma3n:e4b CG OFF 35

Thanks @mxyng for providing changes to gemma3n model graph definition in c4de3ea that make the checking more robust.

ORippler added 2 commits July 25, 2025 12:14
Similar to
ggml-org/llama.cpp#14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.
This should make the heuristics more robust
Copy link
Contributor

@mxyng mxyng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome! thanks for sharing

@mxyng mxyng merged commit ea85e27 into ollama:main Jul 29, 2025
15 of 16 checks passed
gabe-l-hart added a commit to gabe-l-hart/ollama that referenced this pull request Jul 30, 2025
* origin/main:
Revert "CI: switch back to x86 macos builder" (ollama#11588)
mac: disable bf16 on unsupported OS versions (ollama#11585)
CI: switch back to x86 macos builder (ollama#11572)
Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (ollama#11525)
kvcache: Don't shift empty batches
docs: fix typos and remove trailing whitespaces (ollama#11554)

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
rick-github pushed a commit to rick-github/ollama that referenced this pull request Aug 20, 2025
…aph execution (ollama#11525)

* Enable CUDA Graphs for gemma3n.

Similar to
ggml-org/llama.cpp#14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.

* Remove residual check by reshaping differently in gemma3n model

This should make the heuristics more robust
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants