Skip to content

Qwen3.5: DeepSpeed ZeRO-3 fails to load weights for language_model #45313

@debOliveira

Description

@debOliveira

System Info

  • transformers version: 5.4.0
  • Platform: Linux (H200 x4)
  • Python version: 3.12.0
  • DeepSpeed version: 0.18.5
  • PyTorch version: 2.8.0+cu128 (CUDA)

Problem

When loading Qwen/Qwen3.5-27B (also tested with 9B) with DeepSpeed ZeRO-3, language_model parameters are reported as MISSING in the load report.

Key                                                                  | Status  | Details
---------------------------------------------------------------------+---------+--------
model.language_model.layers.{0...63}.post_attention_layernorm.weight | MISSING |        
model.language_model.layers.{0...62}.linear_attn.out_proj.weight     | MISSING |        
model.language_model.norm.weight                                     | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_qkv.weight  | MISSING |        
model.language_model.layers.{0...62}.linear_attn.conv1d.weight       | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_norm.weight         | MISSING |        
model.language_model.layers.{0...62}.linear_attn.dt_bias             | MISSING |        
model.language_model.layers.{0...62}.linear_attn.A_log               | MISSING |        
model.language_model.layers.{0...62}.linear_attn.norm.weight         | MISSING |        
model.language_model.layers.{0...63}.mlp.gate_proj.weight            | MISSING |        
model.language_model.layers.{0...63}.mlp.down_proj.weight            | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_a.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_proj.weight         | MISSING |        
model.language_model.layers.{0...63}.input_layernorm.weight          | MISSING |        
model.language_model.layers.{0...63}.mlp.up_proj.weight              | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_b.weight    | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_z.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_norm.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.o_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.v_proj.weight         | MISSING |        
model.language_model.embed_tokens.weight                             | MISSING | 

Cause hypothesis

In conversion_mapping.py, the language_model weight keys are remapped to model only. This conversion is called in _load_pretrained_model when DeepSpeed ZeRO-3 is turned on.

"qwen3_5_text": [
WeightRenaming(source_patterns=r"^model.language_model", target_patterns="model"),
],

The problem disappears when setting target_patterns="model.language_model" or using ZeRO-2.

Reproduction

from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig

ds_cfg = {
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 1,
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
    }
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto")

Expected Behavior

Model weights should load correctly with DeepSpeed ZeRO-3.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Should FixThis has been identified as a bug and should be fixed.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions