System Info
transformers version: 5.4.0
- Platform: Linux (H200 x4)
- Python version: 3.12.0
- DeepSpeed version: 0.18.5
- PyTorch version: 2.8.0+cu128 (CUDA)
Problem
When loading Qwen/Qwen3.5-27B (also tested with 9B) with DeepSpeed ZeRO-3, language_model parameters are reported as MISSING in the load report.
Key | Status | Details
---------------------------------------------------------------------+---------+--------
model.language_model.layers.{0...63}.post_attention_layernorm.weight | MISSING |
model.language_model.layers.{0...62}.linear_attn.out_proj.weight | MISSING |
model.language_model.norm.weight | MISSING |
model.language_model.layers.{0...62}.linear_attn.in_proj_qkv.weight | MISSING |
model.language_model.layers.{0...62}.linear_attn.conv1d.weight | MISSING |
model.language_model.layers.{3...63}.self_attn.q_norm.weight | MISSING |
model.language_model.layers.{0...62}.linear_attn.dt_bias | MISSING |
model.language_model.layers.{0...62}.linear_attn.A_log | MISSING |
model.language_model.layers.{0...62}.linear_attn.norm.weight | MISSING |
model.language_model.layers.{0...63}.mlp.gate_proj.weight | MISSING |
model.language_model.layers.{0...63}.mlp.down_proj.weight | MISSING |
model.language_model.layers.{0...62}.linear_attn.in_proj_a.weight | MISSING |
model.language_model.layers.{3...63}.self_attn.q_proj.weight | MISSING |
model.language_model.layers.{3...63}.self_attn.k_proj.weight | MISSING |
model.language_model.layers.{0...63}.input_layernorm.weight | MISSING |
model.language_model.layers.{0...63}.mlp.up_proj.weight | MISSING |
model.language_model.layers.{0...62}.linear_attn.in_proj_b.weight | MISSING |
model.language_model.layers.{0...62}.linear_attn.in_proj_z.weight | MISSING |
model.language_model.layers.{3...63}.self_attn.k_norm.weight | MISSING |
model.language_model.layers.{3...63}.self_attn.o_proj.weight | MISSING |
model.language_model.layers.{3...63}.self_attn.v_proj.weight | MISSING |
model.language_model.embed_tokens.weight | MISSING |
Cause hypothesis
In conversion_mapping.py, the language_model weight keys are remapped to model only. This conversion is called in _load_pretrained_model when DeepSpeed ZeRO-3 is turned on.
|
"qwen3_5_text": [ |
|
WeightRenaming(source_patterns=r"^model.language_model", target_patterns="model"), |
|
], |
The problem disappears when setting target_patterns="model.language_model" or using ZeRO-2.
Reproduction
from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig
ds_cfg = {
"bf16": {"enabled": True},
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 1,
"zero_optimization": {
"stage": 3,
}
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto")
Expected Behavior
Model weights should load correctly with DeepSpeed ZeRO-3.
Related issues
System Info
transformersversion: 5.4.0Problem
When loading Qwen/Qwen3.5-27B (also tested with 9B) with DeepSpeed ZeRO-3,
language_modelparameters are reported as MISSING in the load report.Cause hypothesis
In
conversion_mapping.py, thelanguage_modelweight keys are remapped tomodelonly. This conversion is called in_load_pretrained_modelwhen DeepSpeed ZeRO-3 is turned on.transformers/src/transformers/conversion_mapping.py
Lines 155 to 157 in d081c71
The problem disappears when setting
target_patterns="model.language_model"or using ZeRO-2.Reproduction
Expected Behavior
Model weights should load correctly with DeepSpeed ZeRO-3.
Related issues
save_pretrainedAPI since version 5.4.0 #45216