docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630
Merged
Conversation
Rewrite plan.md to reconcile the original "SpecForge Redesign Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine. Docs only — no code change. - Canonical substrate = runtime control + data plane (SampleRef + FeatureStore incl. Mooncake + FeatureDataLoader). The isolated-pool / >100GB/s requirement overturns the original "no Mooncake" bet. - No separate HiddenStateStream source of truth — FeatureDataLoader over SampleRef+FeatureStore IS the stream; topology variation lives in (ref source + FeatureStore), shielded from training. - training/inference become plan.md-style domain packages on top: keep the runtime TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers (checkpoint/ eval/no_sync/resume); converge SGLangAdapter -> TargetEngine + backends. - colocated lightweight path = control plane opt-in/no-op, guarded by a colocated == disaggregated numerical-equivalence gate. - original redesign draft preserved verbatim in docs/redesign-draft-legacy.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Add docs/roadmap/ — per-phase Goal/Target/Implementation/Tests/Done-when across three tracks (domain-refactor, online-disaggregation, eval-and-breadth) plus a README index. The online track folds in the former online-disaggregation roadmap (#618) so there is one roadmap home. Apply two scope decisions to plan.md and the roadmap: - Frozen target, no weight sync. "Train-with-decode" = a frozen target streaming hidden states (W2/W3), not a serve-and-push workload. Drop the W4 weight lifecycle (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream); draft_weight_version is provenance only. - Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator; decision gate in the online roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-only) + E Pin the final code structure to one implementation home per concern, so reviewers don't read the current scattered layout as the end state: - plan.md §2.3 rewritten to S: runtime/ = substrate only (control+data plane + contracts); top-level training/ and inference/ are the single execution homes; modeling/ = model definitions only; launch.py lifts to top level; no facade package. - roadmap: new standing decision (one home per concern; runtime substrate-only; new code born in its final home). Phase E split into E0 (move-only layout consolidation, gated by the unchanged suite + Phase-B byte-identical gate) + E (composition, re-pointed to the S layout, incl. the (algorithm x backend) target-engine collapse). Dependency arrows + phase table updated (D -> E0 -> E). - reconcile stale paths so the plan is internally consistent: models/drafts -> modeling/draft and seam-home wording across §2.2/§3/§6/§7/§10 and eval-and-breadth.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This was referenced Jul 1, 2026
- C: scope note — the no-op axis is (metadata store + durable ack); leasing/queue bookkeeping stays shared (in-process, no I/O) and backpressure was already opt-in; deployment_mode is selectable on the offline builder while the colocated online builder stays pinned (its rank-private queue is fed by commit dedup). Gate description matches the landed test (one builder, SQLite disagg leg, durable-marker assert). - D: full resume is per-rank (optimizer/RNG are FSDP-shard-local) and includes the mid-epoch data position + offline-stream seek; loss continuity is tolerance-based (BF16Optimizer rebuilds its fp32 master from bf16 weights — matching legacy), with weights and data position exact; Done-when updated accordingly (+ same-world-size constraint, DP-reduced evaluator, restart-surviving best/latest). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
maocheng23
added a commit
that referenced
this pull request
Jul 1, 2026
…fe Evaluator, durable best tracking Post-review fixes for #637 (adversarial review vs the #630 roadmap): Multi-rank resume correctness: - CheckpointManager.save now writes each rank's optimizer/RNG to its own training_state_rank{r}.pt beside the rank0 shared payload (draft weights + counters + world_size); under FSDP use_orig_params the AdamW moments live on rank-local shard views, so persisting only rank0's copy and restoring it everywhere corrupted the other ranks' moments and collapsed their RNG streams. read_resume_state hands each rank back its own state and fails fast on a world-size mismatch (and on legacy single-file checkpoints at world_size>1). Resume repositions the data stream (plan.md G1 seek()-equivalent): - TrainerController tracks epoch_batch, persists it, and skips the consumed prefix of the interrupted epoch on resume (via the new FeatureDataLoader.seek — no feature materialization — or islice for plain iterables). The domain Trainer threads it for refs-mode runs; the online queue path keeps control-plane skip_ids reconciliation. - test_resume_loss_curve_continuity now trains on DISTINCT batches, so resuming on the wrong data cannot pass (the old fixed-batch form was structurally blind to the data position). Evaluator DP correctness: - The collective schedule is decided globally (SUM of scalar sums, MAX of the per-position length, then ONE stacked count reduce) so a rank with an empty or scalar-only shard issues the same collectives as its peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now reduced across ranks like everything else. Collectives use the local device via specforge.utils.get_local_device (CPU for gloo). Accumulation stays on-device (one host sync after the loop), and eval/per_position_acc is reported alongside the folded acc-len. - New 2-process gloo gate: cross-rank scalar reduction + ragged-shard schedule symmetry (test_evaluator_aggregation). Durable, decoupled best tracking: - CheckpointManager rehydrates best_score/best_step from best_meta.json (now also carrying "score") on construction, so a restarted process neither rotates away the on-disk best nor lets a worse score overwrite it; update_best is split into score()/is_better(). - fit() tracks the best on EVERY eval (when checkpointing is enabled), persisting a checkpoint on demand when the best eval lands off the save cadence — previously best only fired when eval_interval and save_interval coincided on the same step. Surface + cleanup: - launch: every builder now forwards resume_from/max_checkpoints, so resume and rotation are reachable from the build_* entry points. - TrainerController: drop the max_checkpoints/checkpoint_manager dual config (inject a configured manager; the domain Trainer does); remove the dead eval_step + mode plumbing (evaluate goes through Evaluator on raw forward_loss; validate_batch still runs inside every strategy's forward_loss); drop the did_eval/last_eval_metrics threading. - Trainer._load_resume_state removed: CheckpointManager.read_resume_state is the single checkpoint reader; the loaded dict is dropped after the weight copy instead of living through the FSDP wrap. - CheckpointManager._rotate uses shutil.rmtree; symlink creation is guarded for filesystems without symlink support (dirs + best_meta.json stay the source of truth); save_checkpoint filters draft weights on rank0 only. - test_no_sync_equiv now also counts no_sync deferrals (exactly accumulation_steps-1 per optimizer step) — the roadmap's "one all-reduce per optimizer step" gate, not just weight equality. - DESIGN.md updated (per-rank layout, seek, best semantics; fixed the inverted no_sync sentence). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
E1's evaluator/best-tracking half landed with Phase D (#637); the EvalConfig/EvalCache half is the E1 PR, with cache wiring into the run surface deferred to Phase E's config+CLI (noted explicitly). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 1, 2026
…ft landed early (PR #640) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
jiapingW
approved these changes
Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Docs only — no code change. Rewrites
plan.mdto reconcile the original "SpecForge RedesignPlan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow
runtime/spine, andadds a consolidated
docs/roadmap/that folds in the online-disaggregation roadmap (#618).Why
There were two architecture efforts for the same goal. They aren't either/or — they're
complementary, given a real multi-node / >100 GB/s / isolated-pool requirement that overturns
the original draft's "no Mooncake / HTTP is sufficient" bet.
What the reconciled plan says
SampleRef(metadata) +FeatureStore(tensors, Local/SharedDir/Mooncake) +FeatureDataLoader→TrainBatch.HiddenStateStreamsource of truth —FeatureDataLoaderoverSampleRef+FeatureStorealready is the stream; online/offline/disaggregated variation livesin (ref source +
FeatureStore) and is shielded from training.training/inferencebecome plan.md-style domain packages on top: keep the runtimeTrainerCore/DraftTrainStrategyseam, addTrainerlifecycle + managers(
CheckpointManager/Evaluator/no_sync()/full resume); convergeSGLangAdapter→TargetEngine+ backends (de-EAGLE3).guarded by a colocated≡disaggregated numerical-equivalence gate.
Scope decisions (consolidating with #618)
states (W2/W3), not a serve-and-push workload. The predecessor's W4 weight lifecycle
(
WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream) is out ofscope;
draft_weight_versionis provenance only.(multi-node N-producer/M-trainer); decision gate lives in the online roadmap. Until then we keep
the home-grown metadata-only control plane.
Consolidated roadmap (
docs/roadmap/)Per-phase Goal / Target state / Implementation (files+symbols) / Tests / Done-when across three
tracks, with a README index (standing decisions, phase-status-at-a-glance, cross-track deps):
domain-refactor.md— A (done) → B (TargetEngine+ domainTrainer) → C (colocated) → D(managers) → E (drafts registry / config / CLI / export).
online-disaggregation.md— folds in [DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618: O1.1/O1.2 (in review) → O1.3 live frozen-targetcapture (next) → O2 scale-out (Ray = open) → O3 hardening.
eval-and-breadth.md— E1 acceptance-length eval harness → E2 algorithm breadth (new algo = aStrategySpec+ loss).Relation to the in-flight work
The composable-launch stack (#627 / #628 / #629 —
StrategySpecregistry + parameterizedlaunch.py; eagle3/dflash/domino end-to-end) is Phase A. The online track (folding #618)proceeds in parallel; #618 is superseded by
docs/roadmap/online-disaggregation.md.Files
plan.md— rewritten (reconciled); frozen-target + Ray-open applied.docs/roadmap/— README + three track docs (new).docs/redesign-draft-legacy.md— the original redesign draft, preserved verbatim (with a"superseded by plan.md" banner).
🤖 Generated with Claude Code