[qwen3] Round-trip dense and MoE HF adapters#3557
Open
ivy-zhou wants to merge 5 commits into
Open
Conversation
Summary: torchtitan's per-model StateDictAdapter implementations convert between native state dict layout and HuggingFace's safetensors layout. Every transformation they perform (key rename, RoPE permutation, fused-vs-separate QKV repack, weight tying handling) is by design a pure tensor reshuffle. The end-to-end round-trip must therefore be invertible bit-for-bit, both as in-memory dicts and as serialized safetensors files on disk. Add tests/unit_tests/test_hf_checkpoint_roundtrip.py with a parameterized framework over (model_name, flavor) pairs. Per pair: build the model on CPU with random init, capture native state dict, to_hf once, from_hf, to_hf again, then assert both (a) native-side bit equality across all keys and (b) byte-equal safetensors files for the two HF outputs. The HF-side file comparison catches bugs the in-memory dict comparison would miss (e.g. shared-storage tensors that safetensors refuses to serialize, or non-determinism in the safetensors layer). This first diff covers the llama3 baseline (debugmodel + debugmodel_fused_qkv) — the only adapter currently free of round-trip bugs. Subsequent diffs in this stack add llama4, gpt_oss, deepseek_v3, and qwen3, each fixing the specific bugs the test exposes for that model and auditing the adapter for the from_hf input-mutation antipattern. Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - Both llama3 subTests pass (vanilla GQA + fused QKV path). - Test runtime: ~3 seconds on CPU.
Summary: Llama4StateDictAdapter has three round-trip bugs that the new HF round-trip test catches. This diff fixes all three and adds llama4 to the test parametrization. 1. MoE expert weight names are out of sync with the model. `common/moe.py` declares MoE expert weights as `w1_EFD`, `w2_EDF`, `w3_EFD` (the shape-suffix convention added in PR pytorch#3425), but the adapter still references `w1`, `w2`, `w3` everywhere. The rename PR updated the model code but did not touch the adapters. This diff updates the adapter to use the suffixed names throughout — in `from_hf_map`, in the `to_hf` branch that transposes `w2` for storage, in the `to_combine` collection logic, and in the `from_hf` chunk-and-split that materializes `w1` / `w3` from HF's `gate_up_proj`. 2. Dense FF layers have no mappings. Llama4's interleaved MoE pattern (debugmodel uses `interleave_moe_layer_step=2`) means layers 0, 2, 4 are dense FF with `feed_forward.w1` / `w2` / `w3.weight` keys, while layers 1, 3, 5 are MoE. The adapter only had MoE mappings — dense FF layers had nowhere to go. Add the dense mappings analogous to llama3, with the HF `gate_proj` / `up_proj` / `down_proj` names under the `feed_forward` prefix per HF's `Llama4TextMLP` and `Llama4TextDecoderLayer`. 3. `expert_bias_E` cannot survive HF round-trip. This is auxiliary-loss-free load-balancing bias (DeepSeek paper, adapted for Qwen/Mixtral-style routers including llama4). HF transformers' Llama4 router has no equivalent — `Llama4Router` is plain `nn.Linear(bias=False)` and the MoE forward never adds any bias. So a real HF llama4 safetensors file cannot have any key for it. We keep the existing `None` mapping in `from_hf_map` (drop on `to_hf`, matches HF format) and add an explicit zeros-reinitialization in `from_hf` so the native key set stays complete. The bias resets across an HF round-trip by design. Adapter audit notes: - No `from_hf` input-mutation antipattern (the function constructs a fresh `state_dict = {}` at the top, never mutates the input). - The transpose on `w2_EDF` (and on `w1_EFD` / `w3_EFD` during the `to_combine` step) produces non-contiguous views. safetensors refuses to serialize non-contiguous tensors, so we now `.contiguous()` after the transpose where the view escapes into `hf_state_dict` directly. The `to_combine` path is OK because `torch.cat` produces a fresh contiguous tensor. Test changes: - Add `("llama4", "debugmodel")` to `_MODEL_FLAVORS`. - Introduce `_NATIVE_EXCLUSIONS: dict[tuple[str, str], list[str]]` for keys whose tensor values cannot round-trip bit-exactly (the key set itself must still match — exclusions only suppress the per-tensor `torch.equal` check). Every exclusion is explicitly listed with a comment explaining why; this is the only entry so far. - Add the llama4 exclusion for `.*\.moe\.expert_bias_E$` with a comment citing HF's missing equivalent and the load-balancing semantics. Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - All three subTests pass: llama3/debugmodel, llama3/debugmodel_fused_qkv, llama4/debugmodel. - Test runtime: ~3.6 seconds.
Summary: GptOssStateDictAdapter has the same class of round-trip bugs llama4 had — out-of-sync MoE expert weight names and missing expert_bias_E handling. This diff fixes both and adds gpt_oss to the test parametrization. 1. MoE expert weight names are out of sync with the model. `gpt_oss/moe.py` declares MoE expert weights as `mlp1_weight_EGD`, `mlp1_bias_EG`, `mlp2_weight_EDF`, `mlp2_bias_ED` (the shape-suffix convention added in PR pytorch#3425), but the adapter still references `mlp1_weight`, `mlp1_bias`, `mlp2_weight`, `mlp2_bias`. Update `from_hf_map` to use the suffixed names. 2. `expert_bias_E` is missing from `from_hf_map` entirely. gpt_oss uses the same auxiliary-loss-free load-balancing scheme as llama4. HF transformers' gpt_oss router has no equivalent buffer, so add the `None: "layers.{}.moe.expert_bias_E"` mapping (drop on `to_hf`, matches HF format) and reinitialize it to zeros in `from_hf` so the native key set stays complete. Same design as llama4 — the bias resets across an HF round-trip by design. Adapter audit notes: - No `from_hf` input-mutation antipattern (the function constructs a fresh `state_dict = {}` at the top, never mutates the input). - The `to_hf` function did not guard against `None` hf-key values from the inverted `to_hf_map` (would crash with `AttributeError` on `hf_key.format(layer_num)` once the `None: ...` entry was added). Fix by adding `if hf_key is None: continue` checks in both the layered and non-layered branches — same pattern llama4 already had. - gpt_oss does not transpose expert weights, so no contiguous-view issue. Test changes: - Add `("gpt_oss", "debugmodel")` to `_MODEL_FLAVORS`. - Add gpt_oss exclusion entry to `_NATIVE_EXCLUSIONS` for `.*\.moe\.expert_bias_E$`, citing the same HF-has-no-equivalent reasoning as llama4. Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - All four subTests pass: llama3/debugmodel, llama3/debugmodel_fused_qkv, llama4/debugmodel, gpt_oss/debugmodel. - Test runtime: ~4.6 seconds.
Summary: DeepSeekV3StateDictAdapter had the same MoE expert weight name mismatch as llama4 and gpt_oss — `from_hf_map` references `w1` / `w2` / `w3` while `common/moe.py` declares `w1_EFD` / `w2_EDF` / `w3_EFD` (the shape-suffix convention from PR pytorch#3425). This manifested as a hard crash, not a silent miss: `to_hf` invokes `to_hf_map[abstract_key]` directly with no guard, so the missing entry raises `KeyError: 'layers.{}.moe.experts.w1_EFD'`. Rename the three MoE expert weight entries in `from_hf_map` to use the suffixed names. deepseek_v3 is the only MoE adapter in this stack where `expert_bias_E` actually round-trips: HF transformers' DeepSeek router has `e_score_correction_bias` as a real persisted buffer (originally introduced by the DeepSeek paper as part of the model definition), and the adapter already maps to it on line 56: "model.layers.{}.mlp.gate.e_score_correction_bias": "layers.{}.moe.expert_bias_E" So no `_NATIVE_EXCLUSIONS` entry is needed for deepseek_v3, unlike llama4 / gpt_oss where HF's router has no equivalent buffer and the bias is dropped + reset to zeros across the round-trip. Adapter audit notes: - No `from_hf` input-mutation antipattern (the function constructs a fresh `state_dict = {}` at the top, never mutates the input). - to_hf invokes `to_hf_map[abstract_key]` and `to_hf_map[key]` without a presence check or None guard. Today this works for all non-MoE-expert keys because every `from_hf_map` value is a real titan key (no `None` values like the llama4 / gpt_oss `expert_bias_E` drop pattern). If a future change adds a `None: ...` entry, the same guard work as in llama4 / gpt_oss will be needed. Not adding it preemptively to keep this diff focused. - No expert-weight transposes, so no contiguous-view issue. Test changes: - Add `("deepseek_v3", "debugmodel")` to `_MODEL_FLAVORS`. - No `_NATIVE_EXCLUSIONS` entry (expert_bias_E round-trips via HF's e_score_correction_bias). Test Plan: - python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v - All five subTests pass: llama3/debugmodel, llama3/debugmodel_fused_qkv, llama4/debugmodel, gpt_oss/debugmodel, deepseek_v3/debugmodel. - Test runtime: ~5.3 seconds.
Summary: Qwen3StateDictAdapter.from_hf was mutating the caller's HF state dict while restoring lm_head.weight for tied-weight dense Qwen3 checkpoints. That leaked a shared-storage alias back into the HF dict, making later safetensors serialization fail. Fix the tying path by shallow-copying the input before restoring lm_head.weight. Also include Qwen3 MoE in the HF round-trip coverage and bring its adapter in line with the shape-suffix GroupedExperts names (w1_EFD/w2_EDF/w3_EFD). Qwen3 MoE has the same auxiliary-loss-free expert_bias_E gap as llama4/gpt_oss: HF has no persisted buffer for it, so to_hf drops it and from_hf restores zeros to keep the native key set complete. Test changes: - Add qwen3/debugmodel, qwen3/debugmodel_fused_qkv, and qwen3/debugmodel_moe to test_hf_checkpoint_roundtrip. - Add a Qwen3 MoE expert_bias_E native-value exclusion because the HF format cannot preserve that buffer. Test Plan: python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v python -m py_compile tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py flake8 tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py Reviewers: Subscribers: Tasks: Tags:
|
|
|
The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:
Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary: Qwen3StateDictAdapter.from_hf was mutating the caller's HF state dict while restoring lm_head.weight for tied-weight dense Qwen3 checkpoints. That leaked a shared-storage alias back into the HF dict, making later safetensors serialization fail.
Fix the tying path by shallow-copying the input before restoring lm_head.weight. Also include Qwen3 MoE in the HF round-trip coverage and bring its adapter in line with the shape-suffix GroupedExperts names (w1_EFD/w2_EDF/w3_EFD). Qwen3 MoE has the same auxiliary-loss-free expert_bias_E gap as llama4/gpt_oss: HF has no persisted buffer for it, so to_hf drops it and from_hf restores zeros to keep the native key set complete.
Test changes:
Test Plan: python -m unittest tests.unit_tests.test_hf_checkpoint_roundtrip -v
python -m py_compile tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py
flake8 tests/unit_tests/test_hf_checkpoint_roundtrip.py torchtitan/models/qwen3/state_dict_adapter.py
Reviewers:
Subscribers:
Tasks:
Tags:
Stack created with Sapling. Best reviewed with ReviewStack.