Skip to content

[DataFlow runtime] Composable launch: StrategySpec registry + parameterized builders#627

Merged
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-21-composable-launch
Jul 2, 2026
Merged

[DataFlow runtime] Composable launch: StrategySpec registry + parameterized builders#627
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-21-composable-launch

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

What

Refactor the DataFlow runtime launch layer so adding a draft model is a StrategySpec entry, not a new build_*_runtime family. The topology stays a named builder; the model becomes a strategy= parameter resolved through a registry. launch.py no longer grows as (topologies × models).

Why

Every build_*_eagle3_runtime hardcoded Eagle3TrainStrategy + strategy="eagle3" + the eagle3 reader/collate, so adding dflash/domino would multiply the ~7 topology builders (N×M). The components (TrainerCore / FSDPTrainingBackend / FeatureDataLoader) were already model-agnostic — the duplication was purely in the wiring layer.

Changes

  • specforge/runtime/training/registry.py (new): StrategySpec + register_strategy / resolve_strategy / available_strategies + concat_collate. eagle3 spec fully wired (reader / transform / collate / online-collate / adapter).
  • launch.py: extract _assemble_trainer + _assemble_rollout_workers shared by every topology (offline / disagg-offline / online / disagg-online producer+consumer + one-process + interleaved); each builder takes strategy= and resolves a spec. The eagle3-named builders are kept as back-compat aliases — eagle3 behavior is byte-identical.
  • scripts/train_eagle3_dataflow.py, examples/disagg/run_disagg_eagle3.py: use the strategy-neutral builders.
  • tests/test_runtime/test_strategy_registry.py (new, CPU): registry / alias / unwired-strategy-guard contract.

Testing

Full tests/test_runtime green on sci-h200 (H200) at the tip of this stack: 197 tests OK (2 env skips, 1 pre-existing Mooncake xfail). eagle3's existing launch + equivalence tests pass unchanged.

Stacked on dataflow-up-20-online-async-loop (#625). Part 1/3 of the algorithm-organization cleanup; dflash + domino follow as stacked PRs.

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 added a commit that referenced this pull request Jun 30, 2026
… W3′ naming

Review fixes (verified against the files):
- Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629)
  "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one
  consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A).
  Leave the spine's "landed" wording (it is merged).
- Module placement (confirmed): Evaluator/EvalCache are top-level domain managers
  (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match
  plan.md §2.3 and domain-refactor.md.
- W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports
  (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2,
  the workload table and §G2 rather than overloading one name.
- O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid
  narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3
  extraction and domain Trainer carry no engine risk.

Additional contradictions found by a completeness sweep and fixed:
- StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E
  move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/.
- TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not
  "absorbs runtime/inference adapters".
- Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames.
- Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel).
- Drop the up-19/up-20 branch tags that only appeared in the online doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 marked this pull request as ready for review June 30, 2026 23:26
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maocheng23

Copy link
Copy Markdown
Collaborator Author

Code review

No high-confidence issues found. Checked for bugs in the StrategySpec registry refactor, builder aliases, offline/online assembly, and strategy-specific guard behavior.

…erized builders

Adding a draft model is now a StrategySpec entry, not a new build_*_runtime
family. The topology stays a named builder; the model becomes a `strategy=`
parameter resolved through a registry. launch.py no longer grows as
(topologies x models).

- registry.py (new): StrategySpec + register_strategy/resolve_strategy/
  available_strategies + concat_collate. eagle3 spec fully wired
  (reader/transform/collate/online-collate/adapter).
- launch.py: extract _assemble_trainer + _assemble_rollout_workers shared by
  every topology (offline / disagg-offline / online / disagg-online
  producer+consumer + one-process + interleaved); each builder takes `strategy=`
  and resolves a spec. eagle3-named builders kept as back-compat aliases;
  eagle3 behavior is byte-identical.
- scripts/train_eagle3_dataflow.py, examples/disagg/run_disagg_eagle3.py: use
  the strategy-neutral builders.
- tests/test_runtime/test_strategy_registry.py (new, CPU): registry / alias /
  unwired-strategy-guard contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-20-online-async-loop branch from 9dba169 to 17a8770 Compare July 1, 2026 00:29
@maocheng23 maocheng23 force-pushed the dataflow-up-21-composable-launch branch from 457b20a to 8faf111 Compare July 1, 2026 00:29
@maocheng23

Copy link
Copy Markdown
Collaborator Author

Addressed review feedback (self-review pass).

  • registry.py: resolve_strategy error now reuses available_strategies() (one source of truth); dropped the dead StrategySpec.offline_target_repr field (never read — the reader hardcodes target_repr).
  • launch.py: removed the unreferenced _online_cat_collate back-compat alias; added _online_collate() guard so a supports_online strategy with no make_online_collate fails with an actionable NotImplementedError instead of TypeError: NoneType is not callable.

Validated: full tests/test_runtime = 200 OK (2 skipped, 1 xfail), zero failures, on a 2-node H200 pod. Lint clean (black 24.10.0 / isort 5.13.2 / autoflake).

Base automatically changed from dataflow-up-20-online-async-loop to dataflow-up-16-zerocopy July 2, 2026 05:19
@jiapingW jiapingW merged commit ec10f47 into dataflow-up-16-zerocopy Jul 2, 2026
1 check passed
@jiapingW jiapingW deleted the dataflow-up-21-composable-launch branch July 2, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants