🚀 The feature, motivation and pitch
Add apply_liger_kernel_to_gemma4_text patching support for the Gemma 4 text model family.
Gemma 4 was released by Google DeepMind on April 2, 2026, as their most capable open model family to date. The series includes 31B (Dense), 26B-A4B (MoE), E4B, and E2B variants, all under Apache 2.0 license. These models are seeing rapid adoption for fine-tuning and downstream tasks.
Gemma 4's text backbone (gemma4_text in HuggingFace transformers) introduces several architectural changes from Gemma 3:
- Hybrid Sliding/Full Attention with GQA, alternating between sliding window (512 tokens) and full attention layers
- GeGLU activation (
gelu_pytorch_tanh) in the MLP, with conditional double-wide MLP for KV-shared layers
- Standard RMSNorm — unlike Gemma 3 which uses
(1 + weight) * normed (zero-initialized weights with offset), Gemma 4 uses the standard weight * normed (one-initialized weights, no offset)
- Per-layer v_norm with
with_scale=False (no learnable weight) in some attention heads
- Large vocabulary (262,144 tokens), making FLCE particularly beneficial for memory savings
- RoPE with dual configurations:
theta=10000 for sliding attention, theta=1000000 with proportional scaling + partial rotary for full attention
Liger Kernel already supports the Gemma family:
Gemma — via apply_liger_kernel_to_gemma
Gemma2 — via apply_liger_kernel_to_gemma2
Gemma3 (Text) — via apply_liger_kernel_to_gemma3_text
Gemma3 (Multimodal) — via apply_liger_kernel_to_gemma3
Gemma 4 registers as a separate model type (gemma4_text) in HuggingFace transformers (v5.5.0+), so a dedicated patch function is needed for auto-patching to work (e.g., via AutoLigerKernelForCausalLM or the HF Trainer's use_liger_kernel=True).
The following kernels are applicable:
| Kernel |
Applicable? |
Notes |
| RMSNorm |
✅ |
Standard RMSNorm (offset=0.0), not the Gemma 3 offset variant. 6-7 instances per layer. |
| GeGLU |
✅ |
Requires a custom wrapper (LigerGemma4TextMLP) due to (config, layer_idx) init signature and conditional double-wide MLP logic. |
| CrossEntropyLoss / FLCE |
✅ |
Vocab size is 262,144 — FLCE memory savings are significant. |
| RoPE |
❌ |
Gemma 4 uses apply_rotary_pos_emb(x, cos, sin) (single-tensor, called separately for q and k), incompatible with Liger's fused (q, k, cos, sin) variant. |
Models: gemma-4-31B · gemma-4-31B-it · gemma-4-26B-A4B · gemma-4-E4B · gemma-4-E2B
I have a working implementation and would be happy to open a PR.
Alternatives
The Gemma 3 text patch (apply_liger_kernel_to_gemma3_text) cannot be reused directly because:
- RMSNorm configuration differs (Gemma 3 uses
offset=1.0, init_fn="zeros" while Gemma 4 uses offset=0.0, init_fn="ones")
- The MLP class has a different
__init__ signature ((config, layer_idx) with conditional double-wide logic)
- The model type string is different (
gemma4_text vs gemma3_text), so auto-patching would not trigger
Additional context
This issue covers the text-only model (Gemma4ForCausalLM). Multimodal support (Gemma4ForConditionalGeneration) with vision/audio encoders could be addressed in a follow-up issue.
Alternatives
No response
Additional context
No response
🚀 The feature, motivation and pitch
Add
apply_liger_kernel_to_gemma4_textpatching support for the Gemma 4 text model family.Gemma 4 was released by Google DeepMind on April 2, 2026, as their most capable open model family to date. The series includes 31B (Dense), 26B-A4B (MoE), E4B, and E2B variants, all under Apache 2.0 license. These models are seeing rapid adoption for fine-tuning and downstream tasks.
Gemma 4's text backbone (
gemma4_textin HuggingFace transformers) introduces several architectural changes from Gemma 3:gelu_pytorch_tanh) in the MLP, with conditional double-wide MLP for KV-shared layers(1 + weight) * normed(zero-initialized weights with offset), Gemma 4 uses the standardweight * normed(one-initialized weights, no offset)with_scale=False(no learnable weight) in some attention headstheta=10000for sliding attention,theta=1000000with proportional scaling + partial rotary for full attentionLiger Kernel already supports the Gemma family:
Gemma— viaapply_liger_kernel_to_gemmaGemma2— viaapply_liger_kernel_to_gemma2Gemma3 (Text)— viaapply_liger_kernel_to_gemma3_textGemma3 (Multimodal)— viaapply_liger_kernel_to_gemma3Gemma 4 registers as a separate model type (
gemma4_text) in HuggingFace transformers (v5.5.0+), so a dedicated patch function is needed for auto-patching to work (e.g., viaAutoLigerKernelForCausalLMor the HF Trainer'suse_liger_kernel=True).The following kernels are applicable:
offset=0.0), not the Gemma 3 offset variant. 6-7 instances per layer.LigerGemma4TextMLP) due to(config, layer_idx)init signature and conditional double-wide MLP logic.apply_rotary_pos_emb(x, cos, sin)(single-tensor, called separately for q and k), incompatible with Liger's fused(q, k, cos, sin)variant.Models: gemma-4-31B · gemma-4-31B-it · gemma-4-26B-A4B · gemma-4-E4B · gemma-4-E2B
I have a working implementation and would be happy to open a PR.
Alternatives
The Gemma 3 text patch (
apply_liger_kernel_to_gemma3_text) cannot be reused directly because:offset=1.0, init_fn="zeros"while Gemma 4 usesoffset=0.0, init_fn="ones")__init__signature ((config, layer_idx)with conditional double-wide logic)gemma4_textvsgemma3_text), so auto-patching would not triggerAdditional context
This issue covers the text-only model (
Gemma4ForCausalLM). Multimodal support (Gemma4ForConditionalGeneration) with vision/audio encoders could be addressed in a follow-up issue.Alternatives
No response
Additional context
No response