Skip to content

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630

Merged
jiapingW merged 7 commits into
mainfrom
docs/plan-reconciled
Jul 3, 2026
Merged

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630
jiapingW merged 7 commits into
mainfrom
docs/plan-reconciled

Conversation

@maocheng23

@maocheng23 maocheng23 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Docs only — no code change. Rewrites plan.md to reconcile the original "SpecForge Redesign
Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine, and
adds a consolidated docs/roadmap/ that folds in the online-disaggregation roadmap (#618).

Why

There were two architecture efforts for the same goal. They aren't either/or — they're
complementary, given a real multi-node / >100 GB/s / isolated-pool requirement that overturns
the original draft's "no Mooncake / HTTP is sufficient" bet.

What the reconciled plan says

  • Canonical substrate = the runtime control + data plane. SampleRef (metadata) +
    FeatureStore (tensors, Local/SharedDir/Mooncake) + FeatureDataLoaderTrainBatch.
  • No separate HiddenStateStream source of truthFeatureDataLoader over
    SampleRef+FeatureStore already is the stream; online/offline/disaggregated variation lives
    in (ref source + FeatureStore) and is shielded from training.
  • training / inference become plan.md-style domain packages on top: keep the runtime
    TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers
    (CheckpointManager/Evaluator/no_sync()/full resume); converge SGLangAdapter
    TargetEngine + backends (de-EAGLE3).
  • Colocated lightweight path = control plane opt-in/no-op (one canonical path, not a fork),
    guarded by a colocated≡disaggregated numerical-equivalence gate.

Scope decisions (consolidating with #618)

  • Frozen target — no weight sync. "Train-with-decode" = a frozen target streaming hidden
    states (W2/W3), not a serve-and-push workload. The predecessor's W4 weight lifecycle
    (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream) is out of
    scope
    ; draft_weight_version is provenance only.
  • Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator
    (multi-node N-producer/M-trainer); decision gate lives in the online roadmap. Until then we keep
    the home-grown metadata-only control plane.

Consolidated roadmap (docs/roadmap/)

Per-phase Goal / Target state / Implementation (files+symbols) / Tests / Done-when across three
tracks, with a README index (standing decisions, phase-status-at-a-glance, cross-track deps):

  • domain-refactor.md — A (done) → B (TargetEngine + domain Trainer) → C (colocated) → D
    (managers) → E (drafts registry / config / CLI / export).
  • online-disaggregation.mdfolds in [DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618: O1.1/O1.2 (in review) → O1.3 live frozen-target
    capture (next) → O2 scale-out (Ray = open) → O3 hardening.
  • eval-and-breadth.md — E1 acceptance-length eval harness → E2 algorithm breadth (new algo = a
    StrategySpec + loss).

Relation to the in-flight work

The composable-launch stack (#627 / #628 / #629StrategySpec registry + parameterized
launch.py; eagle3/dflash/domino end-to-end) is Phase A. The online track (folding #618)
proceeds in parallel; #618 is superseded by docs/roadmap/online-disaggregation.md.

Files

  • plan.md — rewritten (reconciled); frozen-target + Ray-open applied.
  • docs/roadmap/ — README + three track docs (new).
  • docs/redesign-draft-legacy.md — the original redesign draft, preserved verbatim (with a
    "superseded by plan.md" banner).

🤖 Generated with Claude Code

Rewrite plan.md to reconcile the original "SpecForge Redesign Plan" (from-scratch,
torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine. Docs only — no
code change.

- Canonical substrate = runtime control + data plane (SampleRef + FeatureStore incl.
  Mooncake + FeatureDataLoader). The isolated-pool / >100GB/s requirement overturns the
  original "no Mooncake" bet.
- No separate HiddenStateStream source of truth — FeatureDataLoader over
  SampleRef+FeatureStore IS the stream; topology variation lives in (ref source +
  FeatureStore), shielded from training.
- training/inference become plan.md-style domain packages on top: keep the runtime
  TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers (checkpoint/
  eval/no_sync/resume); converge SGLangAdapter -> TargetEngine + backends.
- colocated lightweight path = control plane opt-in/no-op, guarded by a
  colocated == disaggregated numerical-equivalence gate.
- original redesign draft preserved verbatim in docs/redesign-draft-legacy.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Add docs/roadmap/ — per-phase Goal/Target/Implementation/Tests/Done-when
across three tracks (domain-refactor, online-disaggregation, eval-and-breadth)
plus a README index. The online track folds in the former online-disaggregation
roadmap (#618) so there is one roadmap home.

Apply two scope decisions to plan.md and the roadmap:
- Frozen target, no weight sync. "Train-with-decode" = a frozen target streaming
  hidden states (W2/W3), not a serve-and-push workload. Drop the W4 weight
  lifecycle (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream);
  draft_weight_version is provenance only.
- Ray is an open decision, not a non-goal. Candidate for the O2 scale-out
  orchestrator; decision gate in the online roadmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… W3′ naming

Review fixes (verified against the files):
- Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629)
  "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one
  consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A).
  Leave the spine's "landed" wording (it is merged).
- Module placement (confirmed): Evaluator/EvalCache are top-level domain managers
  (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match
  plan.md §2.3 and domain-refactor.md.
- W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports
  (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2,
  the workload table and §G2 rather than overloading one name.
- O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid
  narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3
  extraction and domain Trainer carry no engine risk.

Additional contradictions found by a completeness sweep and fixed:
- StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E
  move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/.
- TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not
  "absorbs runtime/inference adapters".
- Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames.
- Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel).
- Drop the up-19/up-20 branch tags that only appeared in the online doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-only) + E

Pin the final code structure to one implementation home per concern, so reviewers don't
read the current scattered layout as the end state:

- plan.md §2.3 rewritten to S: runtime/ = substrate only (control+data plane + contracts);
  top-level training/ and inference/ are the single execution homes; modeling/ = model
  definitions only; launch.py lifts to top level; no facade package.
- roadmap: new standing decision (one home per concern; runtime substrate-only; new code
  born in its final home). Phase E split into E0 (move-only layout consolidation, gated by
  the unchanged suite + Phase-B byte-identical gate) + E (composition, re-pointed to the S
  layout, incl. the (algorithm x backend) target-engine collapse). Dependency arrows +
  phase table updated (D -> E0 -> E).
- reconcile stale paths so the plan is internally consistent: models/drafts -> modeling/draft
  and seam-home wording across §2.2/§3/§6/§7/§10 and eval-and-breadth.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- C: scope note — the no-op axis is (metadata store + durable ack);
  leasing/queue bookkeeping stays shared (in-process, no I/O) and
  backpressure was already opt-in; deployment_mode is selectable on the
  offline builder while the colocated online builder stays pinned (its
  rank-private queue is fed by commit dedup). Gate description matches
  the landed test (one builder, SQLite disagg leg, durable-marker
  assert).
- D: full resume is per-rank (optimizer/RNG are FSDP-shard-local) and
  includes the mid-epoch data position + offline-stream seek; loss
  continuity is tolerance-based (BF16Optimizer rebuilds its fp32 master
  from bf16 weights — matching legacy), with weights and data position
  exact; Done-when updated accordingly (+ same-world-size constraint,
  DP-reduced evaluator, restart-surviving best/latest).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
maocheng23 added a commit that referenced this pull request Jul 1, 2026
…fe Evaluator, durable best tracking

Post-review fixes for #637 (adversarial review vs the #630 roadmap):

Multi-rank resume correctness:
- CheckpointManager.save now writes each rank's optimizer/RNG to its own
  training_state_rank{r}.pt beside the rank0 shared payload (draft
  weights + counters + world_size); under FSDP use_orig_params the AdamW
  moments live on rank-local shard views, so persisting only rank0's
  copy and restoring it everywhere corrupted the other ranks' moments
  and collapsed their RNG streams. read_resume_state hands each rank
  back its own state and fails fast on a world-size mismatch (and on
  legacy single-file checkpoints at world_size>1).

Resume repositions the data stream (plan.md G1 seek()-equivalent):
- TrainerController tracks epoch_batch, persists it, and skips the
  consumed prefix of the interrupted epoch on resume (via the new
  FeatureDataLoader.seek — no feature materialization — or islice for
  plain iterables). The domain Trainer threads it for refs-mode runs;
  the online queue path keeps control-plane skip_ids reconciliation.
- test_resume_loss_curve_continuity now trains on DISTINCT batches, so
  resuming on the wrong data cannot pass (the old fixed-batch form was
  structurally blind to the data position).

Evaluator DP correctness:
- The collective schedule is decided globally (SUM of scalar sums, MAX
  of the per-position length, then ONE stacked count reduce) so a rank
  with an empty or scalar-only shard issues the same collectives as its
  peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now
  reduced across ranks like everything else. Collectives use the local
  device via specforge.utils.get_local_device (CPU for gloo).
  Accumulation stays on-device (one host sync after the loop), and
  eval/per_position_acc is reported alongside the folded acc-len.
- New 2-process gloo gate: cross-rank scalar reduction + ragged-shard
  schedule symmetry (test_evaluator_aggregation).

Durable, decoupled best tracking:
- CheckpointManager rehydrates best_score/best_step from best_meta.json
  (now also carrying "score") on construction, so a restarted process
  neither rotates away the on-disk best nor lets a worse score
  overwrite it; update_best is split into score()/is_better().
- fit() tracks the best on EVERY eval (when checkpointing is enabled),
  persisting a checkpoint on demand when the best eval lands off the
  save cadence — previously best only fired when eval_interval and
  save_interval coincided on the same step.

Surface + cleanup:
- launch: every builder now forwards resume_from/max_checkpoints, so
  resume and rotation are reachable from the build_* entry points.
- TrainerController: drop the max_checkpoints/checkpoint_manager dual
  config (inject a configured manager; the domain Trainer does);
  remove the dead eval_step + mode plumbing (evaluate goes through
  Evaluator on raw forward_loss; validate_batch still runs inside every
  strategy's forward_loss); drop the did_eval/last_eval_metrics
  threading.
- Trainer._load_resume_state removed: CheckpointManager.read_resume_state
  is the single checkpoint reader; the loaded dict is dropped after the
  weight copy instead of living through the FSDP wrap.
- CheckpointManager._rotate uses shutil.rmtree; symlink creation is
  guarded for filesystems without symlink support (dirs + best_meta.json
  stay the source of truth); save_checkpoint filters draft weights on
  rank0 only.
- test_no_sync_equiv now also counts no_sync deferrals (exactly
  accumulation_steps-1 per optimizer step) — the roadmap's "one
  all-reduce per optimizer step" gate, not just weight equality.
- DESIGN.md updated (per-rank layout, seek, best semantics; fixed the
  inverted no_sync sentence).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
E1's evaluator/best-tracking half landed with Phase D (#637); the
EvalConfig/EvalCache half is the E1 PR, with cache wiring into the run
surface deferred to Phase E's config+CLI (noted explicitly).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ft landed early (PR #640)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jiapingW jiapingW self-requested a review July 3, 2026 01:57
@jiapingW jiapingW merged commit 31f3eab into main Jul 3, 2026
5 checks passed
@jiapingW jiapingW deleted the docs/plan-reconciled branch July 3, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants