[qwen3] Round-trip dense and MoE HF adapters by ivy-zhou · Pull Request #3557 · pytorch/torchtitan

ivy-zhou · 2026-06-05T19:47:06Z

Summary: Qwen3StateDictAdapter.from_hf was mutating the caller's HF state dict while restoring lm_head.weight for tied-weight dense Qwen3 checkpoints. That leaked a shared-storage alias back into the HF dict, making later safetensors serialization fail.

Fix the tying path by shallow-copying the input before restoring lm_head.weight. Also include Qwen3 MoE in the HF round-trip coverage and bring its adapter in line with the shape-suffix GroupedExperts names (w1_EFD/w2_EDF/w3_EFD). Qwen3 MoE has the same auxiliary-loss-free expert_bias_E gap as llama4/gpt_oss: HF has no persisted buffer for it, so to_hf drops it and from_hf restores zeros to keep the native key set complete.

Test changes:

Add qwen3/debugmodel, qwen3/debugmodel_fused_qkv, and qwen3/debugmodel_moe to test_hf_checkpoint_roundtrip.
Add a Qwen3 MoE expert_bias_E native-value exclusion because the HF format cannot preserve that buffer.

Test Plan: python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v
python -m py_compile tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py
flake8 tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py

Reviewers:

Subscribers:

Tasks:

Tags:

Stack created with Sapling. Best reviewed with ReviewStack.

Summary: torchtitan's per-model StateDictAdapter implementations convert between native state dict layout and HuggingFace's safetensors layout. Every transformation they perform (key rename, RoPE permutation, fused-vs-separate QKV repack, weight tying handling) is by design a pure tensor reshuffle. The end-to-end round-trip must therefore be invertible bit-for-bit, both as in-memory dicts and as serialized safetensors files on disk. Add tests/unit_tests/test_hf_checkpoint_roundtrip.py with a parameterized framework over (model_name, flavor) pairs. Per pair: build the model on CPU with random init, capture native state dict, to_hf once, from_hf, to_hf again, then assert both (a) native-side bit equality across all keys and (b) byte-equal safetensors files for the two HF outputs. The HF-side file comparison catches bugs the in-memory dict comparison would miss (e.g. shared-storage tensors that safetensors refuses to serialize, or non-determinism in the safetensors layer). This first diff covers the llama3 baseline (debugmodel + debugmodel_fused_qkv) — the only adapter currently free of round-trip bugs. Subsequent diffs in this stack add llama4, gpt_oss, deepseek_v3, and qwen3, each fixing the specific bugs the test exposes for that model and auditing the adapter for the from_hf input-mutation antipattern. Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - Both llama3 subTests pass (vanilla GQA + fused QKV path). - Test runtime: ~3 seconds on CPU.

Summary: Llama4StateDictAdapter has three round-trip bugs that the new HF round-trip test catches. This diff fixes all three and adds llama4 to the test parametrization. 1. MoE expert weight names are out of sync with the model. `common/moe.py` declares MoE expert weights as `w1_EFD`, `w2_EDF`, `w3_EFD` (the shape-suffix convention added in PR pytorch#3425), but the adapter still references `w1`, `w2`, `w3` everywhere. The rename PR updated the model code but did not touch the adapters. This diff updates the adapter to use the suffixed names throughout — in `from_hf_map`, in the `to_hf` branch that transposes `w2` for storage, in the `to_combine` collection logic, and in the `from_hf` chunk-and-split that materializes `w1` / `w3` from HF's `gate_up_proj`. 2. Dense FF layers have no mappings. Llama4's interleaved MoE pattern (debugmodel uses `interleave_moe_layer_step=2`) means layers 0, 2, 4 are dense FF with `feed_forward.w1` / `w2` / `w3.weight` keys, while layers 1, 3, 5 are MoE. The adapter only had MoE mappings — dense FF layers had nowhere to go. Add the dense mappings analogous to llama3, with the HF `gate_proj` / `up_proj` / `down_proj` names under the `feed_forward` prefix per HF's `Llama4TextMLP` and `Llama4TextDecoderLayer`. 3. `expert_bias_E` cannot survive HF round-trip. This is auxiliary-loss-free load-balancing bias (DeepSeek paper, adapted for Qwen/Mixtral-style routers including llama4). HF transformers' Llama4 router has no equivalent — `Llama4Router` is plain `nn.Linear(bias=False)` and the MoE forward never adds any bias. So a real HF llama4 safetensors file cannot have any key for it. We keep the existing `None` mapping in `from_hf_map` (drop on `to_hf`, matches HF format) and add an explicit zeros-reinitialization in `from_hf` so the native key set stays complete. The bias resets across an HF round-trip by design. Adapter audit notes: - No `from_hf` input-mutation antipattern (the function constructs a fresh `state_dict = {}` at the top, never mutates the input). - The transpose on `w2_EDF` (and on `w1_EFD` / `w3_EFD` during the `to_combine` step) produces non-contiguous views. safetensors refuses to serialize non-contiguous tensors, so we now `.contiguous()` after the transpose where the view escapes into `hf_state_dict` directly. The `to_combine` path is OK because `torch.cat` produces a fresh contiguous tensor. Test changes: - Add `("llama4", "debugmodel")` to `_MODEL_FLAVORS`. - Introduce `_NATIVE_EXCLUSIONS: dict[tuple[str, str], list[str]]` for keys whose tensor values cannot round-trip bit-exactly (the key set itself must still match — exclusions only suppress the per-tensor `torch.equal` check). Every exclusion is explicitly listed with a comment explaining why; this is the only entry so far. - Add the llama4 exclusion for `.*\.moe\.expert_bias_E$` with a comment citing HF's missing equivalent and the load-balancing semantics. Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - All three subTests pass: llama3/debugmodel, llama3/debugmodel_fused_qkv, llama4/debugmodel. - Test runtime: ~3.6 seconds.

Summary: GptOssStateDictAdapter has the same class of round-trip bugs llama4 had — out-of-sync MoE expert weight names and missing expert_bias_E handling. This diff fixes both and adds gpt_oss to the test parametrization. 1. MoE expert weight names are out of sync with the model. `gpt_oss/moe.py` declares MoE expert weights as `mlp1_weight_EGD`, `mlp1_bias_EG`, `mlp2_weight_EDF`, `mlp2_bias_ED` (the shape-suffix convention added in PR pytorch#3425), but the adapter still references `mlp1_weight`, `mlp1_bias`, `mlp2_weight`, `mlp2_bias`. Update `from_hf_map` to use the suffixed names. 2. `expert_bias_E` is missing from `from_hf_map` entirely. gpt_oss uses the same auxiliary-loss-free load-balancing scheme as llama4. HF transformers' gpt_oss router has no equivalent buffer, so add the `None: "layers.{}.moe.expert_bias_E"` mapping (drop on `to_hf`, matches HF format) and reinitialize it to zeros in `from_hf` so the native key set stays complete. Same design as llama4 — the bias resets across an HF round-trip by design. Adapter audit notes: - No `from_hf` input-mutation antipattern (the function constructs a fresh `state_dict = {}` at the top, never mutates the input). - The `to_hf` function did not guard against `None` hf-key values from the inverted `to_hf_map` (would crash with `AttributeError` on `hf_key.format(layer_num)` once the `None: ...` entry was added). Fix by adding `if hf_key is None: continue` checks in both the layered and non-layered branches — same pattern llama4 already had. - gpt_oss does not transpose expert weights, so no contiguous-view issue. Test changes: - Add `("gpt_oss", "debugmodel")` to `_MODEL_FLAVORS`. - Add gpt_oss exclusion entry to `_NATIVE_EXCLUSIONS` for `.*\.moe\.expert_bias_E$`, citing the same HF-has-no-equivalent reasoning as llama4. Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - All four subTests pass: llama3/debugmodel, llama3/debugmodel_fused_qkv, llama4/debugmodel, gpt_oss/debugmodel. - Test runtime: ~4.6 seconds.

Summary: DeepSeekV3StateDictAdapter had the same MoE expert weight name mismatch as llama4 and gpt_oss — `from_hf_map` references `w1` / `w2` / `w3` while `common/moe.py` declares `w1_EFD` / `w2_EDF` / `w3_EFD` (the shape-suffix convention from PR pytorch#3425). This manifested as a hard crash, not a silent miss: `to_hf` invokes `to_hf_map[abstract_key]` directly with no guard, so the missing entry raises `KeyError: 'layers.{}.moe.experts.w1_EFD'`. Rename the three MoE expert weight entries in `from_hf_map` to use the suffixed names. deepseek_v3 is the only MoE adapter in this stack where `expert_bias_E` actually round-trips: HF transformers' DeepSeek router has `e_score_correction_bias` as a real persisted buffer (originally introduced by the DeepSeek paper as part of the model definition), and the adapter already maps to it on line 56: "model.layers.{}.mlp.gate.e_score_correction_bias": "layers.{}.moe.expert_bias_E" So no `_NATIVE_EXCLUSIONS` entry is needed for deepseek_v3, unlike llama4 / gpt_oss where HF's router has no equivalent buffer and the bias is dropped + reset to zeros across the round-trip. Adapter audit notes: - No `from_hf` input-mutation antipattern (the function constructs a fresh `state_dict = {}` at the top, never mutates the input). - to_hf invokes `to_hf_map[abstract_key]` and `to_hf_map[key]` without a presence check or None guard. Today this works for all non-MoE-expert keys because every `from_hf_map` value is a real titan key (no `None` values like the llama4 / gpt_oss `expert_bias_E` drop pattern). If a future change adds a `None: ...` entry, the same guard work as in llama4 / gpt_oss will be needed. Not adding it preemptively to keep this diff focused. - No expert-weight transposes, so no contiguous-view issue. Test changes: - Add `("deepseek_v3", "debugmodel")` to `_MODEL_FLAVORS`. - No `_NATIVE_EXCLUSIONS` entry (expert_bias_E round-trips via HF's e_score_correction_bias). Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - All five subTests pass: llama3/debugmodel, llama3/debugmodel_fused_qkv, llama4/debugmodel, gpt_oss/debugmodel, deepseek_v3/debugmodel. - Test runtime: ~5.3 seconds.

Summary: Qwen3StateDictAdapter.from_hf was mutating the caller's HF state dict while restoring lm_head.weight for tied-weight dense Qwen3 checkpoints. That leaked a shared-storage alias back into the HF dict, making later safetensors serialization fail. Fix the tying path by shallow-copying the input before restoring lm_head.weight. Also include Qwen3 MoE in the HF round-trip coverage and bring its adapter in line with the shape-suffix GroupedExperts names (w1_EFD/w2_EDF/w3_EFD). Qwen3 MoE has the same auxiliary-loss-free expert_bias_E gap as llama4/gpt_oss: HF has no persisted buffer for it, so to_hf drops it and from_hf restores zeros to keep the native key set complete. Test changes: - Add qwen3/debugmodel, qwen3/debugmodel_fused_qkv, and qwen3/debugmodel_moe to test_hf_checkpoint_roundtrip. - Add a Qwen3 MoE expert_bias_E native-value exclusion because the HF format cannot preserve that buffer. Test Plan: python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v python -m py_compile tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py flake8 tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2026-06-08T15:25:52Z

~~Workflows were awaiting approval.~~ CI has now been triggered for the ciflow labels on this PR.

pytorch-bot · 2026-06-08T15:25:52Z

The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:

ciflow/8gpu

Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows.

ivy-zhou requested review from fegin, tianyu-l, wconstab and wwwjn as code owners June 5, 2026 19:47

This was referenced Jun 5, 2026

[tests] Add parameterized HuggingFace checkpoint round-trip test #3553

Open

[gpt_oss] Round-trip MoE expert weights through HF adapter #3555

Open

[deepseek_v3] Complete MoE to_hf_map for shape-suffix expert weights #3556

Open

pytorch-bot Bot added the ciflow/8gpu label Jun 5, 2026

ivy-zhou mentioned this pull request Jun 5, 2026

[llama4] Round-trip MoE expert weights through HF adapter #3554

Open

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 5, 2026

ivy-zhou added 5 commits June 8, 2026 08:08

ivy-zhou force-pushed the pr3557 branch from 43731d4 to d675bc1 Compare June 8, 2026 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[qwen3] Round-trip dense and MoE HF adapters#3557

[qwen3] Round-trip dense and MoE HF adapters#3557
ivy-zhou wants to merge 5 commits into
pytorch:mainfrom
ivy-zhou:pr3557

ivy-zhou commented Jun 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivy-zhou commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ivy-zhou commented Jun 5, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 8, 2026 •

edited

Loading