Align gemma3n cache sharing to gemma4 by Cyrilvallez · Pull Request #45489 · huggingface/transformers

Cyrilvallez · 2026-04-17T09:02:08Z

What does this PR do?

As per the title. Bring changes from #45312 and #45336 to gemma3n

HuggingFaceDocBuilderDev · 2026-04-17T09:29:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

Just a few nits, can we inherit from Gemma3n directly then? Looks 1:1 to me

vasqu · 2026-04-17T17:26:58Z



-class Gemma3nTextAttention(Gemma3Attention):
+@use_kernelized_func(apply_rotary_pos_emb)


Looks like it doesn't follow the normal function signature --> have you checked that it actually works when you pass use_kernels=True?

What do you mean? It's how we do it everywhere, and the modeling did not change

The exchanged apply_rotary_pos_emb is not equivalent to the kernels version

transformers/src/transformers/models/gemma4/modeling_gemma4.py

Lines 745 to 764 in cd5bcad

def apply_rotary_pos_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, unsqueeze_dim: int = 1):

"""Applies Rotary Position Embedding to the query and key tensors.

Args:

x (`torch.Tensor`): The tensor to embed.

cos (`torch.Tensor`): The cosine part of the rotary embedding.

sin (`torch.Tensor`): The sine part of the rotary embedding.

unsqueeze_dim (`int`, *optional*, defaults to 1):

The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and

sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note

that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and

k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes

cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have

the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

Returns:

`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.

"""

cos = cos.unsqueeze(unsqueeze_dim)

sin = sin.unsqueeze(unsqueeze_dim)

return (x * cos) + (rotate_half(x) * sin)

This applies RoPE on one tensor only (with the expected chunk split)

However, the kernels version uses

transformers/src/transformers/integrations/hub_kernels.py

Line 224 in c3ec5ff

repo_id="kernels-community/rotary", func_name="apply_rotary_transformers"

https://huggingface.co/kernels-community/rotary/blob/main/build/torch210-cxx11-cu128-aarch64-linux/__init__.py#L19-L49

Which expects both q and k tensors at once (and in chunked implementation). Which means that they are not the same as the gemma4/3n here. This would need to be fixed in the upstream kernels to allow

Single tensors

Interleaved layout

I am not sure which models it affects, but when I saw this, I was pretty sure it was wrong.

Ohhhh I see - this is not from this PR but would indeed need to be fixed asap! Opening another one to fix!

vasqu · 2026-04-17T17:31:22Z

+        attention_mask: torch.Tensor | None,
+        shared_kv_states: dict[int, tuple[torch.Tensor, torch.Tensor]],
        past_key_values: Cache | None = None,
        **kwargs: Unpack[TransformersKwargs],


Suggested change

**kwargs: Unpack[TransformersKwargs],

past_key_values: Cache | None = None,

shared_kv_states: dict[int, tuple[torch.Tensor, torch.Tensor]],

Just for BC we cannot change order

It's just to make crystal clear that it's a required input. Since it's an internal module, this is fine IMO (and maybe better to break, as forgetting them would result in a silently wrong model)

Then let's add at least a 🚨 to the title.

Imo, I don't see why we default to None - if it wouldn't be passed and it was needed, we would encounter a runtime error no?

Yes, indeed we could set to None independently and let it crash if not passed... Let me change it then if you think it's best!

github-actions · 2026-04-22T03:12:24Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma3n, gemma4

Cyrilvallez added 4 commits April 17, 2026 18:00

drop weights

4464111

oupsi

566ed1a

clean

87833e7

simplify modulars

5d63cd4

vasqu approved these changes Apr 17, 2026

View reviewed changes

Cyrilvallez added 2 commits April 20, 2026 11:22

fix

5eac51d

change order for BC

5fbedd6

Cyrilvallez merged commit 08244b9 into main Apr 22, 2026
29 checks passed

Cyrilvallez deleted the gemma3n branch April 22, 2026 03:35

vasqu mentioned this pull request Apr 22, 2026

Gemma3n and Gemma4 cannot use rotary kernel #45564

Merged

Cyrilvallez mentioned this pull request Apr 24, 2026

Skip failing offloading tests #45624

Merged



		class Gemma3nTextAttention(Gemma3Attention):
		@use_kernelized_func(apply_rotary_pos_emb)

	def apply_rotary_pos_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, unsqueeze_dim: int = 1):
	"""Applies Rotary Position Embedding to the query and key tensors.

	Args:
	x (`torch.Tensor`): The tensor to embed.
	cos (`torch.Tensor`): The cosine part of the rotary embedding.
	sin (`torch.Tensor`): The sine part of the rotary embedding.
	unsqueeze_dim (`int`, optional, defaults to 1):
	The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
	sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
	that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
	k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
	cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
	the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
	Returns:
	`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
	"""
	cos = cos.unsqueeze(unsqueeze_dim)
	sin = sin.unsqueeze(unsqueeze_dim)
	return (x * cos) + (rotate_half(x) * sin)

	**kwargs: Unpack[TransformersKwargs],
	past_key_values: Cache \| None = None,
	shared_kv_states: dict[int, tuple[torch.Tensor, torch.Tensor]],

Conversation

Cyrilvallez commented Apr 17, 2026

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants