[Modeling] MLA (DeepSeek) Eagle3 draft architecture by maocheng23 · Pull Request #640 · sgl-project/SpecForge

maocheng23 · 2026-07-02T02:41:05Z

MLA (DeepSeek) Eagle3 draft architecture — plan G4

DeepseekV3ForCausalLMEagle3: MLA attention (compressed-KV LoRA path, split nope/rope head dims, DeepSeek interleaved RoPE derived from the shared neox caches, YaRN-aware softmax scale) on the unchanged eagle3 surface — same Eagle3DraftModel interface, same TTT suffix-cache convention as LlamaAttention, same fc/norm/lm_head/t2d. The strategy/runtime/fixtures are untouched: the plan's two-axis claim (algorithm × draft architecture) holds by construction. Stacked on #638.

MLA math adapted from TorchSpec (MIT, LightSeek Foundation) with the cache/backend contract rewritten to SpecForge's seam; sdpa backend for now (flex/fa/usp need MLA-shaped kernels — documented raise). Registered in AutoEagle3DraftModel/AutoDraftModelConfig (deepseek_v3).

Gates (test_mla_draft.py, GPU): TTT suffix-cache ≡ causal at step 0 (fp32, atol 1e-5); auto-mapping resolves; 3-step train smoke through the unchanged Eagle3TrainStrategy with E1-compatible per-position metrics. Validated: full suite 242 OK (2 skip, 1 xfail) on H200 = E0 baseline 240 + 2. An adversarial review pass over the diff returned zero findings.

🤖 Generated with Claude Code

… checkpoint/eval) Brings the training loop to production parity (plan.md §D / roadmap domain-refactor.md §D): - Grad accumulation with no_sync(). TrainerCore now decides the optimizer boundary BEFORE backward and passes is_boundary to FSDPTrainingBackend.backward, which wraps the non-boundary micro-steps in self.module.no_sync() — the FSDP gradient reduction fires once per optimizer step, not once per micro-step. Single-rank / accumulation_steps=1 is unchanged. - Full resume. FSDPTrainingBackend.state_dict() now returns the full training state {model (FSDP FULL_STATE_DICT), optimizer (BF16Optimizer bundles the LR scheduler), rng (cpu+cuda)}; load_state_dict restores all three. save_checkpoint persists the export draft weights + optimizer + rng; the domain Trainer gains resume_from (restore draft weights before the FSDP wrap so the fp32 master is rebuilt from them, then optimizer+scheduler+rng, then start_step/epoch). - CheckpointManager (specforge/training/checkpoint.py) — {run_id}-step{step} layout, keep-last-N rotation that never drops the tracked best, latest/best symlinks + best_meta.json. Born in its S-home; TrainerController imports it lazily so the runtime seam stays leaf. - Evaluator (specforge/eval/evaluator.py) — aggregates per-position accept counts across the whole eval pass BEFORE the geometric sum, so simulated_acc_len is batch-size invariant. TrainerController.evaluate delegates to it; DFlash/Domino scalar accuracy degenerates gracefully. Gates: test_checkpoint_resume extended (loss-curve continuity across save->resume, tol for the bf16-master reconstruction), test_no_sync_equiv (2-rank: no_sync path == per-step reduction, bit-tight), test_evaluator_aggregation (aggregate-before- geometric-sum, batch-size invariance, CheckpointManager rotation/best/latest). Backend doubles + the resume test updated for the composite state_dict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nly (move-only) Collapse execution code into one home per concern (plan.md §2.3, roadmap E0): - runtime/training/{trainer,backend,strategy,registry} -> training/{controller,backend,strategies/{base,registry}} - runtime/inference/{rollout_worker,capture} -> inference/; {sglang,dflash}_adapter -> inference/adapters/{eagle3,dflash} - modeling/target/{base,factory,eagle3,dflash}_target_model + sglang_backend/ -> inference/target_engine/ - runtime/launch.py -> specforge/launch.py - modeling/target keeps model defs only (target_head, target_utils, custom_backend) Zero functional change: pure moves + import-path rewrites; every old module path keeps a re-export shim for one release. Suite + Phase-B byte gate must pass unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…29-training-managers

…fe Evaluator, durable best tracking Post-review fixes for #637 (adversarial review vs the #630 roadmap): Multi-rank resume correctness: - CheckpointManager.save now writes each rank's optimizer/RNG to its own training_state_rank{r}.pt beside the rank0 shared payload (draft weights + counters + world_size); under FSDP use_orig_params the AdamW moments live on rank-local shard views, so persisting only rank0's copy and restoring it everywhere corrupted the other ranks' moments and collapsed their RNG streams. read_resume_state hands each rank back its own state and fails fast on a world-size mismatch (and on legacy single-file checkpoints at world_size>1). Resume repositions the data stream (plan.md G1 seek()-equivalent): - TrainerController tracks epoch_batch, persists it, and skips the consumed prefix of the interrupted epoch on resume (via the new FeatureDataLoader.seek — no feature materialization — or islice for plain iterables). The domain Trainer threads it for refs-mode runs; the online queue path keeps control-plane skip_ids reconciliation. - test_resume_loss_curve_continuity now trains on DISTINCT batches, so resuming on the wrong data cannot pass (the old fixed-batch form was structurally blind to the data position). Evaluator DP correctness: - The collective schedule is decided globally (SUM of scalar sums, MAX of the per-position length, then ONE stacked count reduce) so a rank with an empty or scalar-only shard issues the same collectives as its peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now reduced across ranks like everything else. Collectives use the local device via specforge.utils.get_local_device (CPU for gloo). Accumulation stays on-device (one host sync after the loop), and eval/per_position_acc is reported alongside the folded acc-len. - New 2-process gloo gate: cross-rank scalar reduction + ragged-shard schedule symmetry (test_evaluator_aggregation). Durable, decoupled best tracking: - CheckpointManager rehydrates best_score/best_step from best_meta.json (now also carrying "score") on construction, so a restarted process neither rotates away the on-disk best nor lets a worse score overwrite it; update_best is split into score()/is_better(). - fit() tracks the best on EVERY eval (when checkpointing is enabled), persisting a checkpoint on demand when the best eval lands off the save cadence — previously best only fired when eval_interval and save_interval coincided on the same step. Surface + cleanup: - launch: every builder now forwards resume_from/max_checkpoints, so resume and rotation are reachable from the build_* entry points. - TrainerController: drop the max_checkpoints/checkpoint_manager dual config (inject a configured manager; the domain Trainer does); remove the dead eval_step + mode plumbing (evaluate goes through Evaluator on raw forward_loss; validate_batch still runs inside every strategy's forward_loss); drop the did_eval/last_eval_metrics threading. - Trainer._load_resume_state removed: CheckpointManager.read_resume_state is the single checkpoint reader; the loaded dict is dropped after the weight copy instead of living through the FSDP wrap. - CheckpointManager._rotate uses shutil.rmtree; symlink creation is guarded for filesystems without symlink support (dirs + best_meta.json stay the source of truth); save_checkpoint filters draft weights on rank0 only. - test_no_sync_equiv now also counts no_sync deferrals (exactly accumulation_steps-1 per optimizer step) — the roadmap's "one all-reduce per optimizer step" gate, not just weight equality. - DESIGN.md updated (per-rank layout, seek, best semantics; fixed the inverted no_sync sentence). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…29-training-managers # Conflicts: # specforge/runtime/launch.py

…e position, bound comm device Self-review of the fix commit surfaced four defects, all fixed: - fit()'s on-demand best-save gated collective-bearing save_checkpoint on a PER-RANK is_better() whose best_score comes from rank-local filesystem reads (best_meta.json rehydration) — a divergent view (NFS attribute-cache lag, node-local dirs) would deadlock the group. rank0's verdict is now broadcast (_rank0_decision) and update_best gains force=True so every rank follows it. - The mid-epoch position was persisted in BATCH units only, so resuming with a different batch_size silently mis-seeked the stream. The position is now also tracked/persisted in SAMPLES (epoch_samples); the domain Trainer converts samples back to this run's batches and fails fast when the sample position does not divide by the resumed batch size — symmetric with the world-size guard. - Evaluator._comm_device used get_local_device() (LOCAL_RANK env, default 0): a non-torchrun launcher that doesn't export LOCAL_RANK would put every rank's reduction on cuda:0 and break the NCCL communicator. Now uses the rank's BOUND device (torch.cuda.current_device(), set by init_distributed on every path). - The docstring/DESIGN claim that a ragged shard "cannot desynchronize NCCL" over-promised: it holds for the evaluator's OWN reductions (shard content), but a collective forward (FSDP all-gathers per batch) still requires the same eval-batch COUNT per rank. The docs now state the requirement precisely. Also removed the stale TrainerCore.eval_step rows from the DESIGN.md/ARCHITECTURE.md endpoint tables (the method was removed in the previous commit). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A shape-[1] loss tensor (e.g. a parameter-shaped scalar) failed to broadcast into the 0-dim float64 accumulator slot; .mean() normalizes exactly like the trainer's _scalar helper. Caught by the new misaligned-intervals best-tracking gate on the pod. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…0-layout Brings the C/D review-fix commits into the consolidated layout. The old-path shim files stay shims; the fixes are ported verbatim onto the moved homes (training/controller.py, training/backend.py, launch.py — verified byte-identical to the up-29 sources modulo E0's import-path rewrites). Same-path files (domain trainer, checkpoint, evaluator, data plane, control plane, tests) merged directly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t/latest pointers Two follow-ups from the E0/E1 adversarial review that belong to this PR's code: - Scalar-accuracy strategies (DFlash/Domino) were not batch-size invariant: the Evaluator averaged per-batch means with equal weight, so a ragged last batch skewed eval/avg_acc and simulated_acc_len with eval batching. The scalar path now token-weights (sum of correct over sum of tokens — the ttt_length=1 case of the sum/count rule); new invariance gate, DP-gate expectation updated to the weighted semantics. - CheckpointManager wrote absolute-target symlinks, so resume_from=<dir>/best broke (or silently followed the old absolute path) after relocating output_dir; targets are now relative. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…0-layout

DeepseekV3ForCausalLMEagle3: MLA attention (compressed KV via kv_a/kv_b LoRA path, split nope/rope head dims, interleaved-pair RoPE from the shared neox caches, YaRN-aware softmax scale) on the UNCHANGED eagle3 surface — same Eagle3DraftModel interface, same TTT suffix-cache (cache_hidden) convention as LlamaAttention, same fc/norm/lm_head/t2d. The eagle3 strategy, runtime, and fixtures are untouched: the two-axis claim (algorithm vs draft architecture) holds by construction. MLA attention math adapted from TorchSpec (MIT, LightSeek Foundation) with the cache/backend contract rewritten to SpecForge's seam; sdpa backend only for now (flex/fa/usp need MLA-shaped kernels — asymmetric q/k vs v head dims; documented raise). Gates (tests/test_runtime/test_mla_draft.py, GPU): suffix-cache ≡ causal at step 0; Auto* mapping resolves deepseek_v3; 3-step train smoke through the unchanged Eagle3TrainStrategy over the shared fixtures with E1-compatible per-position metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist · 2026-07-02T02:41:08Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ft landed early (PR #640) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…29-training-managers

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…0-layout # Conflicts: # specforge/runtime/training/trainer.py

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Now that the MLA draft (#640) sits below this PR in the stack, the exporter recognizes DeepseekV3ForCausalLMEagle3 (identity map; per-key renames belong to the plan 10.4 sglang-load gate, with docs/export_weight_map_mla.md as the spec). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maocheng23 and others added 11 commits July 1, 2026 02:23

Merge branch 'dataflow-up-28-colocated-lightweight' into dataflow-up-…

b8ffa2a

…29-training-managers

Merge branch 'dataflow-up-28-colocated-lightweight' into dataflow-up-…

04bb60a

…29-training-managers # Conflicts: # specforge/runtime/launch.py

Merge branch 'dataflow-up-29-training-managers' into dataflow-up-30-e…

0e50c9a

…0-layout

maocheng23 added a commit that referenced this pull request Jul 2, 2026

docs(roadmap): O1.3 in review (reforward transport, PR #641); MLA dra…

19cfbd0

…ft landed early (PR #640) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

This was referenced Jul 2, 2026

[SpecForge] Phase E — draft-architecture registry (@register_draft) #642

Open

[SpecForge] Phase E — exporters: DataFlow checkpoint → HF / sglang draft directories #645

Open

maocheng23 and others added 6 commits July 2, 2026 00:52

Merge branch 'dataflow-up-28-colocated-lightweight' into dataflow-up-…

33a6bf1

…29-training-managers

style: apply pre-commit (black/isort/autoflake)

0d6a456

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'dataflow-up-29-training-managers' into dataflow-up-30-e…

ad2ee7e

…0-layout # Conflicts: # specforge/runtime/training/trainer.py

style: apply pre-commit (black/isort/autoflake)

80b906b

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'dataflow-up-30-e0-layout' into dataflow-mla-draft

8b35b4d

style: apply pre-commit (black/isort/autoflake)

fbe8966

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maocheng23 force-pushed the dataflow-up-30-e0-layout branch 2 times, most recently from 1e3e516 to 3ae106c Compare July 4, 2026 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Modeling] MLA (DeepSeek) Eagle3 draft architecture#640

[Modeling] MLA (DeepSeek) Eagle3 draft architecture#640
maocheng23 wants to merge 17 commits into
dataflow-up-30-e0-layoutfrom
dataflow-mla-draft

maocheng23 commented Jul 2, 2026

Uh oh!

gemini-code-assist Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

maocheng23 commented Jul 2, 2026

MLA (DeepSeek) Eagle3 draft architecture — plan G4

Uh oh!

gemini-code-assist Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant