Gemma4 text model support

### 🚀 The feature, motivation and pitch

Add `apply_liger_kernel_to_gemma4_text` patching support for the Gemma 4 text model family.

[Gemma 4](https://deepmind.google/models/gemma/gemma-4/) was released by Google DeepMind on April 2, 2026, as their most capable open model family to date. The series includes 31B (Dense), 26B-A4B (MoE), E4B, and E2B variants, all under Apache 2.0 license. These models are seeing rapid adoption for fine-tuning and downstream tasks.

Gemma 4's text backbone (`gemma4_text` in HuggingFace transformers) introduces several architectural changes from Gemma 3:

- **Hybrid Sliding/Full Attention** with GQA, alternating between sliding window (512 tokens) and full attention layers
- **GeGLU activation** (`gelu_pytorch_tanh`) in the MLP, with conditional double-wide MLP for KV-shared layers
- **Standard RMSNorm** — unlike Gemma 3 which uses `(1 + weight) * normed` (zero-initialized weights with offset), Gemma 4 uses the standard `weight * normed` (one-initialized weights, no offset)
- **Per-layer v_norm** with `with_scale=False` (no learnable weight) in some attention heads
- **Large vocabulary** (262,144 tokens), making FLCE particularly beneficial for memory savings
- **RoPE with dual configurations**: `theta=10000` for sliding attention, `theta=1000000` with proportional scaling + partial rotary for full attention

Liger Kernel already supports the Gemma family:
- `Gemma` — via `apply_liger_kernel_to_gemma`
- `Gemma2` — via `apply_liger_kernel_to_gemma2`
- `Gemma3 (Text)` — via `apply_liger_kernel_to_gemma3_text`
- `Gemma3 (Multimodal)` — via `apply_liger_kernel_to_gemma3`

Gemma 4 registers as a separate model type (`gemma4_text`) in HuggingFace transformers (v5.5.0+), so a dedicated patch function is needed for auto-patching to work (e.g., via `AutoLigerKernelForCausalLM` or the HF Trainer's `use_liger_kernel=True`).

The following kernels are applicable:

| Kernel | Applicable? | Notes |
|--------|:-----------:|-------|
| **RMSNorm** | ✅ | Standard RMSNorm (`offset=0.0`), not the Gemma 3 offset variant. 6-7 instances per layer. |
| **GeGLU** | ✅ | Requires a custom wrapper (`LigerGemma4TextMLP`) due to `(config, layer_idx)` init signature and conditional double-wide MLP logic. |
| **CrossEntropyLoss / FLCE** | ✅ | Vocab size is 262,144 — FLCE memory savings are significant. |
| **RoPE** | ❌ | Gemma 4 uses `apply_rotary_pos_emb(x, cos, sin)` (single-tensor, called separately for q and k), incompatible with Liger's fused `(q, k, cos, sin)` variant. |

Models: [gemma-4-31B](https://huggingface.co/google/gemma-4-31B) · [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) · [gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B) · [gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B) · [gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)

I have a working implementation and would be happy to open a PR.

### Alternatives

The Gemma 3 text patch (`apply_liger_kernel_to_gemma3_text`) cannot be reused directly because:
1. RMSNorm configuration differs (Gemma 3 uses `offset=1.0, init_fn="zeros"` while Gemma 4 uses `offset=0.0, init_fn="ones"`)
2. The MLP class has a different `__init__` signature (`(config, layer_idx)` with conditional double-wide logic)
3. The model type string is different (`gemma4_text` vs `gemma3_text`), so auto-patching would not trigger

### Additional context

This issue covers the text-only model (`Gemma4ForCausalLM`). Multimodal support (`Gemma4ForConditionalGeneration`) with vision/audio encoders could be addressed in a follow-up issue.

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 text model support #1186

🚀 The feature, motivation and pitch

Alternatives

Additional context

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kernel	Applicable?	Notes
RMSNorm	✅	Standard RMSNorm (`offset=0.0`), not the Gemma 3 offset variant. 6-7 instances per layer.
GeGLU	✅	Requires a custom wrapper (`LigerGemma4TextMLP`) due to `(config, layer_idx)` init signature and conditional double-wide MLP logic.
CrossEntropyLoss / FLCE	✅	Vocab size is 262,144 — FLCE memory savings are significant.
RoPE	❌	Gemma 4 uses `apply_rotary_pos_emb(x, cos, sin)` (single-tensor, called separately for q and k), incompatible with Liger's fused `(q, k, cos, sin)` variant.

Gemma4 text model support #1186

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions