Skip to content

Conversation

@ikawrakow
Copy link
Owner

When performing the RoPE operation one needs to compute the rotation angles for each token in the batch and each Q and K attention head. When there are no per layer frequency factors, this computation only depends on the token position and the index within the attention head, so is exactly the same (per token) for every layer and each head. Hence, one could simply compute the cosine and sine of the rotation angles once per graph, and then reuse the result for all layers.

This PR implements this idea for a subset of the supported models (Qwen3, Qwen3-MoE, Ling/Ring, GPT-OSS, GLM-4.5-MoE).

We observe small but noticeable performance gains for PP and TG.

The PR needs a bit more work to add a command-line argument to enable/disable this feature as the implementation is only for the CUDA and CPU back-ends. Also, for now the implementation is only for the NEOX and NORM RoPE variants, so vision related RoPE variants still need to get implemented.

Still, putting it out there for testing.

It will also be interesting to see how long it will take until this optimization is fully independently discovered in mainline llama.cpp /s

Iwan Kawrakow added 9 commits November 1, 2025 15:58
When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants