[DataFlow runtime] Composable launch: StrategySpec registry + parameterized builders by maocheng23 · Pull Request #627 · sgl-project/SpecForge

maocheng23 · 2026-06-30T19:07:25Z

What

Refactor the DataFlow runtime launch layer so adding a draft model is a StrategySpec entry, not a new build_*_runtime family. The topology stays a named builder; the model becomes a strategy= parameter resolved through a registry. launch.py no longer grows as (topologies × models).

Why

Every build_*_eagle3_runtime hardcoded Eagle3TrainStrategy + strategy="eagle3" + the eagle3 reader/collate, so adding dflash/domino would multiply the ~7 topology builders (N×M). The components (TrainerCore / FSDPTrainingBackend / FeatureDataLoader) were already model-agnostic — the duplication was purely in the wiring layer.

Changes

specforge/runtime/training/registry.py (new): StrategySpec + register_strategy / resolve_strategy / available_strategies + concat_collate. eagle3 spec fully wired (reader / transform / collate / online-collate / adapter).
launch.py: extract _assemble_trainer + _assemble_rollout_workers shared by every topology (offline / disagg-offline / online / disagg-online producer+consumer + one-process + interleaved); each builder takes strategy= and resolves a spec. The eagle3-named builders are kept as back-compat aliases — eagle3 behavior is byte-identical.
scripts/train_eagle3_dataflow.py, examples/disagg/run_disagg_eagle3.py: use the strategy-neutral builders.
tests/test_runtime/test_strategy_registry.py (new, CPU): registry / alias / unwired-strategy-guard contract.

Testing

Full tests/test_runtime green on sci-h200 (H200) at the tip of this stack: 197 tests OK (2 env skips, 1 pre-existing Mooncake xfail). eagle3's existing launch + equivalence tests pass unchanged.

Stacked on dataflow-up-20-online-async-loop (#625). Part 1/3 of the algorithm-organization cleanup; dflash + domino follow as stacked PRs.

🤖 Generated with Claude Code

gemini-code-assist · 2026-06-30T19:07:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist · 2026-06-30T23:26:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 · 2026-06-30T23:27:06Z

Code review

No high-confidence issues found. Checked for bugs in the StrategySpec registry refactor, builder aliases, offline/online assembly, and strategy-specific guard behavior.

…erized builders Adding a draft model is now a StrategySpec entry, not a new build_*_runtime family. The topology stays a named builder; the model becomes a `strategy=` parameter resolved through a registry. launch.py no longer grows as (topologies x models). - registry.py (new): StrategySpec + register_strategy/resolve_strategy/ available_strategies + concat_collate. eagle3 spec fully wired (reader/transform/collate/online-collate/adapter). - launch.py: extract _assemble_trainer + _assemble_rollout_workers shared by every topology (offline / disagg-offline / online / disagg-online producer+consumer + one-process + interleaved); each builder takes `strategy=` and resolves a spec. eagle3-named builders kept as back-compat aliases; eagle3 behavior is byte-identical. - scripts/train_eagle3_dataflow.py, examples/disagg/run_disagg_eagle3.py: use the strategy-neutral builders. - tests/test_runtime/test_strategy_registry.py (new, CPU): registry / alias / unwired-strategy-guard contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 · 2026-07-01T00:30:48Z

Addressed review feedback (self-review pass).

registry.py: resolve_strategy error now reuses available_strategies() (one source of truth); dropped the dead StrategySpec.offline_target_repr field (never read — the reader hardcodes target_repr).
launch.py: removed the unreferenced _online_cat_collate back-compat alias; added _online_collate() guard so a supports_online strategy with no make_online_collate fails with an actionable NotImplementedError instead of TypeError: NoneType is not callable.

Validated: full tests/test_runtime = 200 OK (2 skipped, 1 xfail), zero failures, on a 2-node H200 pod. Lint clean (black 24.10.0 / isort 5.13.2 / autoflake).

maocheng23 mentioned this pull request Jun 30, 2026

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer) #630

Merged

maocheng23 marked this pull request as ready for review June 30, 2026 23:26

maocheng23 requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners June 30, 2026 23:26

maocheng23 force-pushed the dataflow-up-20-online-async-loop branch from 9dba169 to 17a8770 Compare July 1, 2026 00:29

maocheng23 force-pushed the dataflow-up-21-composable-launch branch from 457b20a to 8faf111 Compare July 1, 2026 00:29

Base automatically changed from dataflow-up-20-online-async-loop to dataflow-up-16-zerocopy July 2, 2026 05:19

jiapingW approved these changes Jul 2, 2026

View reviewed changes

jiapingW merged commit ec10f47 into dataflow-up-16-zerocopy Jul 2, 2026
1 check passed

jiapingW deleted the dataflow-up-21-composable-launch branch July 2, 2026 05:41

maocheng23 mentioned this pull request Jul 4, 2026

Merge DataFlow runtime branch into main #648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataFlow runtime] Composable launch: StrategySpec registry + parameterized builders#627

[DataFlow runtime] Composable launch: StrategySpec registry + parameterized builders#627
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-21-composable-launch

maocheng23 commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

maocheng23 commented Jun 30, 2026

Uh oh!

maocheng23 commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maocheng23 commented Jun 30, 2026

What

Why

Changes

Testing

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

maocheng23 commented Jun 30, 2026

Code review

Uh oh!

maocheng23 commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants