docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer) by maocheng23 · Pull Request #630 · sgl-project/SpecForge

maocheng23 · 2026-06-30T20:39:00Z

Docs only — no code change. Rewrites plan.md to reconcile the original "SpecForge Redesign
Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine, and
adds a consolidated docs/roadmap/ that folds in the online-disaggregation roadmap (#618).

Why

There were two architecture efforts for the same goal. They aren't either/or — they're
complementary, given a real multi-node / >100 GB/s / isolated-pool requirement that overturns
the original draft's "no Mooncake / HTTP is sufficient" bet.

What the reconciled plan says

Canonical substrate = the runtime control + data plane. SampleRef (metadata) +
FeatureStore (tensors, Local/SharedDir/Mooncake) + FeatureDataLoader → TrainBatch.
No separate HiddenStateStream source of truth — FeatureDataLoader over
SampleRef+FeatureStore already is the stream; online/offline/disaggregated variation lives
in (ref source + FeatureStore) and is shielded from training.
training / inference become plan.md-style domain packages on top: keep the runtime
TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers
(CheckpointManager/Evaluator/no_sync()/full resume); converge SGLangAdapter →
TargetEngine + backends (de-EAGLE3).
Colocated lightweight path = control plane opt-in/no-op (one canonical path, not a fork),
guarded by a colocated≡disaggregated numerical-equivalence gate.

Scope decisions (consolidating with #618)

Frozen target — no weight sync. "Train-with-decode" = a frozen target streaming hidden
states (W2/W3), not a serve-and-push workload. The predecessor's W4 weight lifecycle
(WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream) is out of
scope; draft_weight_version is provenance only.
Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator
(multi-node N-producer/M-trainer); decision gate lives in the online roadmap. Until then we keep
the home-grown metadata-only control plane.

Consolidated roadmap (`docs/roadmap/`)

Per-phase Goal / Target state / Implementation (files+symbols) / Tests / Done-when across three
tracks, with a README index (standing decisions, phase-status-at-a-glance, cross-track deps):

domain-refactor.md — A (done) → B (TargetEngine + domain Trainer) → C (colocated) → D
(managers) → E (drafts registry / config / CLI / export).
online-disaggregation.md — folds in [DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618: O1.1/O1.2 (in review) → O1.3 live frozen-target
capture (next) → O2 scale-out (Ray = open) → O3 hardening.
eval-and-breadth.md — E1 acceptance-length eval harness → E2 algorithm breadth (new algo = a
StrategySpec + loss).

Relation to the in-flight work

The composable-launch stack (#627 / #628 / #629 — StrategySpec registry + parameterized
launch.py; eagle3/dflash/domino end-to-end) is Phase A. The online track (folding #618)
proceeds in parallel; #618 is superseded by docs/roadmap/online-disaggregation.md.

Files

plan.md — rewritten (reconciled); frozen-target + Ray-open applied.
docs/roadmap/ — README + three track docs (new).
docs/redesign-draft-legacy.md — the original redesign draft, preserved verbatim (with a
"superseded by plan.md" banner).

🤖 Generated with Claude Code

Rewrite plan.md to reconcile the original "SpecForge Redesign Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine. Docs only — no code change. - Canonical substrate = runtime control + data plane (SampleRef + FeatureStore incl. Mooncake + FeatureDataLoader). The isolated-pool / >100GB/s requirement overturns the original "no Mooncake" bet. - No separate HiddenStateStream source of truth — FeatureDataLoader over SampleRef+FeatureStore IS the stream; topology variation lives in (ref source + FeatureStore), shielded from training. - training/inference become plan.md-style domain packages on top: keep the runtime TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers (checkpoint/ eval/no_sync/resume); converge SGLangAdapter -> TargetEngine + backends. - colocated lightweight path = control plane opt-in/no-op, guarded by a colocated == disaggregated numerical-equivalence gate. - original redesign draft preserved verbatim in docs/redesign-draft-legacy.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist · 2026-06-30T20:39:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Add docs/roadmap/ — per-phase Goal/Target/Implementation/Tests/Done-when across three tracks (domain-refactor, online-disaggregation, eval-and-breadth) plus a README index. The online track folds in the former online-disaggregation roadmap (#618) so there is one roadmap home. Apply two scope decisions to plan.md and the roadmap: - Frozen target, no weight sync. "Train-with-decode" = a frozen target streaming hidden states (W2/W3), not a serve-and-push workload. Drop the W4 weight lifecycle (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream); draft_weight_version is provenance only. - Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator; decision gate in the online roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-only) + E Pin the final code structure to one implementation home per concern, so reviewers don't read the current scattered layout as the end state: - plan.md §2.3 rewritten to S: runtime/ = substrate only (control+data plane + contracts); top-level training/ and inference/ are the single execution homes; modeling/ = model definitions only; launch.py lifts to top level; no facade package. - roadmap: new standing decision (one home per concern; runtime substrate-only; new code born in its final home). Phase E split into E0 (move-only layout consolidation, gated by the unchanged suite + Phase-B byte-identical gate) + E (composition, re-pointed to the S layout, incl. the (algorithm x backend) target-engine collapse). Dependency arrows + phase table updated (D -> E0 -> E). - reconcile stale paths so the plan is internally consistent: models/drafts -> modeling/draft and seam-home wording across §2.2/§3/§6/§7/§10 and eval-and-breadth.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- C: scope note — the no-op axis is (metadata store + durable ack); leasing/queue bookkeeping stays shared (in-process, no I/O) and backpressure was already opt-in; deployment_mode is selectable on the offline builder while the colocated online builder stays pinned (its rank-private queue is fed by commit dedup). Gate description matches the landed test (one builder, SQLite disagg leg, durable-marker assert). - D: full resume is per-rank (optimizer/RNG are FSDP-shard-local) and includes the mid-epoch data position + offline-stream seek; loss continuity is tolerance-based (BF16Optimizer rebuilds its fp32 master from bf16 weights — matching legacy), with weights and data position exact; Done-when updated accordingly (+ same-world-size constraint, DP-reduced evaluator, restart-surviving best/latest). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…fe Evaluator, durable best tracking Post-review fixes for #637 (adversarial review vs the #630 roadmap): Multi-rank resume correctness: - CheckpointManager.save now writes each rank's optimizer/RNG to its own training_state_rank{r}.pt beside the rank0 shared payload (draft weights + counters + world_size); under FSDP use_orig_params the AdamW moments live on rank-local shard views, so persisting only rank0's copy and restoring it everywhere corrupted the other ranks' moments and collapsed their RNG streams. read_resume_state hands each rank back its own state and fails fast on a world-size mismatch (and on legacy single-file checkpoints at world_size>1). Resume repositions the data stream (plan.md G1 seek()-equivalent): - TrainerController tracks epoch_batch, persists it, and skips the consumed prefix of the interrupted epoch on resume (via the new FeatureDataLoader.seek — no feature materialization — or islice for plain iterables). The domain Trainer threads it for refs-mode runs; the online queue path keeps control-plane skip_ids reconciliation. - test_resume_loss_curve_continuity now trains on DISTINCT batches, so resuming on the wrong data cannot pass (the old fixed-batch form was structurally blind to the data position). Evaluator DP correctness: - The collective schedule is decided globally (SUM of scalar sums, MAX of the per-position length, then ONE stacked count reduce) so a rank with an empty or scalar-only shard issues the same collectives as its peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now reduced across ranks like everything else. Collectives use the local device via specforge.utils.get_local_device (CPU for gloo). Accumulation stays on-device (one host sync after the loop), and eval/per_position_acc is reported alongside the folded acc-len. - New 2-process gloo gate: cross-rank scalar reduction + ragged-shard schedule symmetry (test_evaluator_aggregation). Durable, decoupled best tracking: - CheckpointManager rehydrates best_score/best_step from best_meta.json (now also carrying "score") on construction, so a restarted process neither rotates away the on-disk best nor lets a worse score overwrite it; update_best is split into score()/is_better(). - fit() tracks the best on EVERY eval (when checkpointing is enabled), persisting a checkpoint on demand when the best eval lands off the save cadence — previously best only fired when eval_interval and save_interval coincided on the same step. Surface + cleanup: - launch: every builder now forwards resume_from/max_checkpoints, so resume and rotation are reachable from the build_* entry points. - TrainerController: drop the max_checkpoints/checkpoint_manager dual config (inject a configured manager; the domain Trainer does); remove the dead eval_step + mode plumbing (evaluate goes through Evaluator on raw forward_loss; validate_batch still runs inside every strategy's forward_loss); drop the did_eval/last_eval_metrics threading. - Trainer._load_resume_state removed: CheckpointManager.read_resume_state is the single checkpoint reader; the loaded dict is dropped after the weight copy instead of living through the FSDP wrap. - CheckpointManager._rotate uses shutil.rmtree; symlink creation is guarded for filesystems without symlink support (dirs + best_meta.json stay the source of truth); save_checkpoint filters draft weights on rank0 only. - test_no_sync_equiv now also counts no_sync deferrals (exactly accumulation_steps-1 per optimizer step) — the roadmap's "one all-reduce per optimizer step" gate, not just weight equality. - DESIGN.md updated (per-rank layout, seek, best semantics; fixed the inverted no_sync sentence). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

E1's evaluator/best-tracking half landed with Phase D (#637); the EvalConfig/EvalCache half is the E1 PR, with cache wiring into the run surface deferred to Phase E's config+CLI (noted explicitly). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ft landed early (PR #640) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maocheng23 mentioned this pull request Jun 30, 2026

[DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618

Closed

maocheng23 mentioned this pull request Jul 1, 2026

[DataFlow runtime] Phase B1 — TargetEngine ABC + de-EAGLE3 the target boundary #631

Merged

maocheng23 marked this pull request as ready for review July 1, 2026 08:22

This was referenced Jul 1, 2026

[DataFlow runtime] Phase C — colocated lightweight control plane #636

Merged

[DataFlow runtime] Phase D — training managers (no_sync, full resume, checkpoint/eval) #637

Open

This was referenced Jul 1, 2026

[DataFlow runtime] E0 — layout consolidation: runtime/ is substrate-only (move-only) #638

Open

[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates #639

Draft

docs(roadmap): O1.3 in review (reforward transport, PR #641); MLA dra…

19cfbd0

…ft landed early (PR #640) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jiapingW self-requested a review July 3, 2026 01:57

jiapingW approved these changes Jul 3, 2026

View reviewed changes

jiapingW merged commit 31f3eab into main Jul 3, 2026
5 checks passed

jiapingW deleted the docs/plan-reconciled branch July 3, 2026 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630
jiapingW merged 7 commits into
mainfrom
docs/plan-reconciled

maocheng23 commented Jun 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maocheng23 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What the reconciled plan says

Scope decisions (consolidating with #618)

Consolidated roadmap (docs/roadmap/)

Relation to the in-flight work

Files

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maocheng23 commented Jun 30, 2026 •

edited

Loading

Consolidated roadmap (`docs/roadmap/`)