Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When performing the RoPE operation one needs to compute the rotation angles for each token in the batch and each
QandKattention head. When there are no per layer frequency factors, this computation only depends on the token position and the index within the attention head, so is exactly the same (per token) for every layer and each head. Hence, one could simply compute the cosine and sine of the rotation angles once per graph, and then reuse the result for all layers.This PR implements this idea for a subset of the supported models (Qwen3, Qwen3-MoE, Ling/Ring, GPT-OSS, GLM-4.5-MoE).
We observe small but noticeable performance gains for PP and TG.
The PR needs a bit more work to add a command-line argument to enable/disable this feature as the implementation is only for the CUDA and CPU back-ends. Also, for now the implementation is only for the
NEOXandNORMRoPE variants, so vision related RoPE variants still need to get implemented.Still, putting it out there for testing.
It will also be interesting to see how long it will take until this optimization is fully independently discovered in mainline
llama.cpp/s