Skip to content

Gemma4 text model support #1186

@ruilin-gif

Description

@ruilin-gif

🚀 The feature, motivation and pitch

Add apply_liger_kernel_to_gemma4_text patching support for the Gemma 4 text model family.

Gemma 4 was released by Google DeepMind on April 2, 2026, as their most capable open model family to date. The series includes 31B (Dense), 26B-A4B (MoE), E4B, and E2B variants, all under Apache 2.0 license. These models are seeing rapid adoption for fine-tuning and downstream tasks.

Gemma 4's text backbone (gemma4_text in HuggingFace transformers) introduces several architectural changes from Gemma 3:

  • Hybrid Sliding/Full Attention with GQA, alternating between sliding window (512 tokens) and full attention layers
  • GeGLU activation (gelu_pytorch_tanh) in the MLP, with conditional double-wide MLP for KV-shared layers
  • Standard RMSNorm — unlike Gemma 3 which uses (1 + weight) * normed (zero-initialized weights with offset), Gemma 4 uses the standard weight * normed (one-initialized weights, no offset)
  • Per-layer v_norm with with_scale=False (no learnable weight) in some attention heads
  • Large vocabulary (262,144 tokens), making FLCE particularly beneficial for memory savings
  • RoPE with dual configurations: theta=10000 for sliding attention, theta=1000000 with proportional scaling + partial rotary for full attention

Liger Kernel already supports the Gemma family:

  • Gemma — via apply_liger_kernel_to_gemma
  • Gemma2 — via apply_liger_kernel_to_gemma2
  • Gemma3 (Text) — via apply_liger_kernel_to_gemma3_text
  • Gemma3 (Multimodal) — via apply_liger_kernel_to_gemma3

Gemma 4 registers as a separate model type (gemma4_text) in HuggingFace transformers (v5.5.0+), so a dedicated patch function is needed for auto-patching to work (e.g., via AutoLigerKernelForCausalLM or the HF Trainer's use_liger_kernel=True).

The following kernels are applicable:

Kernel Applicable? Notes
RMSNorm Standard RMSNorm (offset=0.0), not the Gemma 3 offset variant. 6-7 instances per layer.
GeGLU Requires a custom wrapper (LigerGemma4TextMLP) due to (config, layer_idx) init signature and conditional double-wide MLP logic.
CrossEntropyLoss / FLCE Vocab size is 262,144 — FLCE memory savings are significant.
RoPE Gemma 4 uses apply_rotary_pos_emb(x, cos, sin) (single-tensor, called separately for q and k), incompatible with Liger's fused (q, k, cos, sin) variant.

Models: gemma-4-31B · gemma-4-31B-it · gemma-4-26B-A4B · gemma-4-E4B · gemma-4-E2B

I have a working implementation and would be happy to open a PR.

Alternatives

The Gemma 3 text patch (apply_liger_kernel_to_gemma3_text) cannot be reused directly because:

  1. RMSNorm configuration differs (Gemma 3 uses offset=1.0, init_fn="zeros" while Gemma 4 uses offset=0.0, init_fn="ones")
  2. The MLP class has a different __init__ signature ((config, layer_idx) with conditional double-wide logic)
  3. The model type string is different (gemma4_text vs gemma3_text), so auto-patching would not trigger

Additional context

This issue covers the text-only model (Gemma4ForCausalLM). Multimodal support (Gemma4ForConditionalGeneration) with vision/audio encoders could be addressed in a follow-up issue.

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions