[Modeling] MLA (DeepSeek) Eagle3 draft architecture#640
Draft
maocheng23 wants to merge 17 commits into
Draft
Conversation
… checkpoint/eval)
Brings the training loop to production parity (plan.md §D / roadmap
domain-refactor.md §D):
- Grad accumulation with no_sync(). TrainerCore now decides the optimizer
boundary BEFORE backward and passes is_boundary to
FSDPTrainingBackend.backward, which wraps the non-boundary micro-steps in
self.module.no_sync() — the FSDP gradient reduction fires once per optimizer
step, not once per micro-step. Single-rank / accumulation_steps=1 is unchanged.
- Full resume. FSDPTrainingBackend.state_dict() now returns the full training
state {model (FSDP FULL_STATE_DICT), optimizer (BF16Optimizer bundles the LR
scheduler), rng (cpu+cuda)}; load_state_dict restores all three. save_checkpoint
persists the export draft weights + optimizer + rng; the domain Trainer gains
resume_from (restore draft weights before the FSDP wrap so the fp32 master is
rebuilt from them, then optimizer+scheduler+rng, then start_step/epoch).
- CheckpointManager (specforge/training/checkpoint.py) — {run_id}-step{step}
layout, keep-last-N rotation that never drops the tracked best, latest/best
symlinks + best_meta.json. Born in its S-home; TrainerController imports it
lazily so the runtime seam stays leaf.
- Evaluator (specforge/eval/evaluator.py) — aggregates per-position accept
counts across the whole eval pass BEFORE the geometric sum, so
simulated_acc_len is batch-size invariant. TrainerController.evaluate delegates
to it; DFlash/Domino scalar accuracy degenerates gracefully.
Gates: test_checkpoint_resume extended (loss-curve continuity across save->resume,
tol for the bf16-master reconstruction), test_no_sync_equiv (2-rank: no_sync path
== per-step reduction, bit-tight), test_evaluator_aggregation (aggregate-before-
geometric-sum, batch-size invariance, CheckpointManager rotation/best/latest).
Backend doubles + the resume test updated for the composite state_dict.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nly (move-only)
Collapse execution code into one home per concern (plan.md §2.3, roadmap E0):
- runtime/training/{trainer,backend,strategy,registry} -> training/{controller,backend,strategies/{base,registry}}
- runtime/inference/{rollout_worker,capture} -> inference/; {sglang,dflash}_adapter -> inference/adapters/{eagle3,dflash}
- modeling/target/{base,factory,eagle3,dflash}_target_model + sglang_backend/ -> inference/target_engine/
- runtime/launch.py -> specforge/launch.py
- modeling/target keeps model defs only (target_head, target_utils, custom_backend)
Zero functional change: pure moves + import-path rewrites; every old module path
keeps a re-export shim for one release. Suite + Phase-B byte gate must pass unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…29-training-managers
…fe Evaluator, durable best tracking Post-review fixes for #637 (adversarial review vs the #630 roadmap): Multi-rank resume correctness: - CheckpointManager.save now writes each rank's optimizer/RNG to its own training_state_rank{r}.pt beside the rank0 shared payload (draft weights + counters + world_size); under FSDP use_orig_params the AdamW moments live on rank-local shard views, so persisting only rank0's copy and restoring it everywhere corrupted the other ranks' moments and collapsed their RNG streams. read_resume_state hands each rank back its own state and fails fast on a world-size mismatch (and on legacy single-file checkpoints at world_size>1). Resume repositions the data stream (plan.md G1 seek()-equivalent): - TrainerController tracks epoch_batch, persists it, and skips the consumed prefix of the interrupted epoch on resume (via the new FeatureDataLoader.seek — no feature materialization — or islice for plain iterables). The domain Trainer threads it for refs-mode runs; the online queue path keeps control-plane skip_ids reconciliation. - test_resume_loss_curve_continuity now trains on DISTINCT batches, so resuming on the wrong data cannot pass (the old fixed-batch form was structurally blind to the data position). Evaluator DP correctness: - The collective schedule is decided globally (SUM of scalar sums, MAX of the per-position length, then ONE stacked count reduce) so a rank with an empty or scalar-only shard issues the same collectives as its peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now reduced across ranks like everything else. Collectives use the local device via specforge.utils.get_local_device (CPU for gloo). Accumulation stays on-device (one host sync after the loop), and eval/per_position_acc is reported alongside the folded acc-len. - New 2-process gloo gate: cross-rank scalar reduction + ragged-shard schedule symmetry (test_evaluator_aggregation). Durable, decoupled best tracking: - CheckpointManager rehydrates best_score/best_step from best_meta.json (now also carrying "score") on construction, so a restarted process neither rotates away the on-disk best nor lets a worse score overwrite it; update_best is split into score()/is_better(). - fit() tracks the best on EVERY eval (when checkpointing is enabled), persisting a checkpoint on demand when the best eval lands off the save cadence — previously best only fired when eval_interval and save_interval coincided on the same step. Surface + cleanup: - launch: every builder now forwards resume_from/max_checkpoints, so resume and rotation are reachable from the build_* entry points. - TrainerController: drop the max_checkpoints/checkpoint_manager dual config (inject a configured manager; the domain Trainer does); remove the dead eval_step + mode plumbing (evaluate goes through Evaluator on raw forward_loss; validate_batch still runs inside every strategy's forward_loss); drop the did_eval/last_eval_metrics threading. - Trainer._load_resume_state removed: CheckpointManager.read_resume_state is the single checkpoint reader; the loaded dict is dropped after the weight copy instead of living through the FSDP wrap. - CheckpointManager._rotate uses shutil.rmtree; symlink creation is guarded for filesystems without symlink support (dirs + best_meta.json stay the source of truth); save_checkpoint filters draft weights on rank0 only. - test_no_sync_equiv now also counts no_sync deferrals (exactly accumulation_steps-1 per optimizer step) — the roadmap's "one all-reduce per optimizer step" gate, not just weight equality. - DESIGN.md updated (per-rank layout, seek, best semantics; fixed the inverted no_sync sentence). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…29-training-managers # Conflicts: # specforge/runtime/launch.py
…e position, bound comm device Self-review of the fix commit surfaced four defects, all fixed: - fit()'s on-demand best-save gated collective-bearing save_checkpoint on a PER-RANK is_better() whose best_score comes from rank-local filesystem reads (best_meta.json rehydration) — a divergent view (NFS attribute-cache lag, node-local dirs) would deadlock the group. rank0's verdict is now broadcast (_rank0_decision) and update_best gains force=True so every rank follows it. - The mid-epoch position was persisted in BATCH units only, so resuming with a different batch_size silently mis-seeked the stream. The position is now also tracked/persisted in SAMPLES (epoch_samples); the domain Trainer converts samples back to this run's batches and fails fast when the sample position does not divide by the resumed batch size — symmetric with the world-size guard. - Evaluator._comm_device used get_local_device() (LOCAL_RANK env, default 0): a non-torchrun launcher that doesn't export LOCAL_RANK would put every rank's reduction on cuda:0 and break the NCCL communicator. Now uses the rank's BOUND device (torch.cuda.current_device(), set by init_distributed on every path). - The docstring/DESIGN claim that a ragged shard "cannot desynchronize NCCL" over-promised: it holds for the evaluator's OWN reductions (shard content), but a collective forward (FSDP all-gathers per batch) still requires the same eval-batch COUNT per rank. The docs now state the requirement precisely. Also removed the stale TrainerCore.eval_step rows from the DESIGN.md/ARCHITECTURE.md endpoint tables (the method was removed in the previous commit). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A shape-[1] loss tensor (e.g. a parameter-shaped scalar) failed to broadcast into the 0-dim float64 accumulator slot; .mean() normalizes exactly like the trainer's _scalar helper. Caught by the new misaligned-intervals best-tracking gate on the pod. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…0-layout Brings the C/D review-fix commits into the consolidated layout. The old-path shim files stay shims; the fixes are ported verbatim onto the moved homes (training/controller.py, training/backend.py, launch.py — verified byte-identical to the up-29 sources modulo E0's import-path rewrites). Same-path files (domain trainer, checkpoint, evaluator, data plane, control plane, tests) merged directly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t/latest pointers Two follow-ups from the E0/E1 adversarial review that belong to this PR's code: - Scalar-accuracy strategies (DFlash/Domino) were not batch-size invariant: the Evaluator averaged per-batch means with equal weight, so a ragged last batch skewed eval/avg_acc and simulated_acc_len with eval batching. The scalar path now token-weights (sum of correct over sum of tokens — the ttt_length=1 case of the sum/count rule); new invariance gate, DP-gate expectation updated to the weighted semantics. - CheckpointManager wrote absolute-target symlinks, so resume_from=<dir>/best broke (or silently followed the old absolute path) after relocating output_dir; targets are now relative. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
DeepseekV3ForCausalLMEagle3: MLA attention (compressed KV via kv_a/kv_b LoRA path, split nope/rope head dims, interleaved-pair RoPE from the shared neox caches, YaRN-aware softmax scale) on the UNCHANGED eagle3 surface — same Eagle3DraftModel interface, same TTT suffix-cache (cache_hidden) convention as LlamaAttention, same fc/norm/lm_head/t2d. The eagle3 strategy, runtime, and fixtures are untouched: the two-axis claim (algorithm vs draft architecture) holds by construction. MLA attention math adapted from TorchSpec (MIT, LightSeek Foundation) with the cache/backend contract rewritten to SpecForge's seam; sdpa backend only for now (flex/fa/usp need MLA-shaped kernels — asymmetric q/k vs v head dims; documented raise). Gates (tests/test_runtime/test_mla_draft.py, GPU): suffix-cache ≡ causal at step 0; Auto* mapping resolves deepseek_v3; 3-step train smoke through the unchanged Eagle3TrainStrategy over the shared fixtures with E1-compatible per-position metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
maocheng23
added a commit
that referenced
this pull request
Jul 2, 2026
…ft landed early (PR #640) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 2, 2026
…29-training-managers
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…0-layout # Conflicts: # specforge/runtime/training/trainer.py
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
maocheng23
added a commit
that referenced
this pull request
Jul 2, 2026
Now that the MLA draft (#640) sits below this PR in the stack, the exporter recognizes DeepseekV3ForCausalLMEagle3 (identity map; per-key renames belong to the plan 10.4 sglang-load gate, with docs/export_weight_map_mla.md as the spec). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1e3e516 to
3ae106c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MLA (DeepSeek) Eagle3 draft architecture — plan G4
DeepseekV3ForCausalLMEagle3: MLA attention (compressed-KV LoRA path, split nope/rope head dims, DeepSeek interleaved RoPE derived from the shared neox caches, YaRN-aware softmax scale) on the unchanged eagle3 surface — sameEagle3DraftModelinterface, same TTT suffix-cache convention asLlamaAttention, same fc/norm/lm_head/t2d. The strategy/runtime/fixtures are untouched: the plan's two-axis claim (algorithm × draft architecture) holds by construction. Stacked on #638.MLA math adapted from TorchSpec (MIT, LightSeek Foundation) with the cache/backend contract rewritten to SpecForge's seam;
sdpabackend for now (flex/fa/usp need MLA-shaped kernels — documented raise). Registered inAutoEagle3DraftModel/AutoDraftModelConfig(deepseek_v3).Gates (
test_mla_draft.py, GPU): TTT suffix-cache ≡ causal at step 0 (fp32, atol 1e-5); auto-mapping resolves; 3-step train smoke through the unchangedEagle3TrainStrategywith E1-compatible per-position metrics. Validated: full suite 242 OK (2 skip, 1 xfail) on H200 = E0 baseline 240 + 2. An adversarial review pass over the diff returned zero findings.🤖 Generated with Claude Code