[DataFlow runtime] Composable launch: StrategySpec registry + parameterized builders#627
Merged
jiapingW merged 1 commit intoJul 2, 2026
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
maocheng23
added a commit
that referenced
this pull request
Jun 30, 2026
… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
Code reviewNo high-confidence issues found. Checked for bugs in the StrategySpec registry refactor, builder aliases, offline/online assembly, and strategy-specific guard behavior. |
…erized builders Adding a draft model is now a StrategySpec entry, not a new build_*_runtime family. The topology stays a named builder; the model becomes a `strategy=` parameter resolved through a registry. launch.py no longer grows as (topologies x models). - registry.py (new): StrategySpec + register_strategy/resolve_strategy/ available_strategies + concat_collate. eagle3 spec fully wired (reader/transform/collate/online-collate/adapter). - launch.py: extract _assemble_trainer + _assemble_rollout_workers shared by every topology (offline / disagg-offline / online / disagg-online producer+consumer + one-process + interleaved); each builder takes `strategy=` and resolves a spec. eagle3-named builders kept as back-compat aliases; eagle3 behavior is byte-identical. - scripts/train_eagle3_dataflow.py, examples/disagg/run_disagg_eagle3.py: use the strategy-neutral builders. - tests/test_runtime/test_strategy_registry.py (new, CPU): registry / alias / unwired-strategy-guard contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9dba169 to
17a8770
Compare
457b20a to
8faf111
Compare
Collaborator
Author
|
Addressed review feedback (self-review pass).
Validated: full |
Base automatically changed from
dataflow-up-20-online-async-loop
to
dataflow-up-16-zerocopy
July 2, 2026 05:19
jiapingW
approved these changes
Jul 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Refactor the DataFlow runtime launch layer so adding a draft model is a
StrategySpecentry, not a newbuild_*_runtimefamily. The topology stays a named builder; the model becomes astrategy=parameter resolved through a registry.launch.pyno longer grows as (topologies × models).Why
Every
build_*_eagle3_runtimehardcodedEagle3TrainStrategy+strategy="eagle3"+ the eagle3 reader/collate, so adding dflash/domino would multiply the ~7 topology builders (N×M). The components (TrainerCore/FSDPTrainingBackend/FeatureDataLoader) were already model-agnostic — the duplication was purely in the wiring layer.Changes
specforge/runtime/training/registry.py(new):StrategySpec+register_strategy/resolve_strategy/available_strategies+concat_collate. eagle3 spec fully wired (reader / transform / collate / online-collate / adapter).launch.py: extract_assemble_trainer+_assemble_rollout_workersshared by every topology (offline / disagg-offline / online / disagg-online producer+consumer + one-process + interleaved); each builder takesstrategy=and resolves a spec. The eagle3-named builders are kept as back-compat aliases — eagle3 behavior is byte-identical.scripts/train_eagle3_dataflow.py,examples/disagg/run_disagg_eagle3.py: use the strategy-neutral builders.tests/test_runtime/test_strategy_registry.py(new, CPU): registry / alias / unwired-strategy-guard contract.Testing
Full
tests/test_runtimegreen on sci-h200 (H200) at the tip of this stack: 197 tests OK (2 env skips, 1 pre-existing Mooncake xfail). eagle3's existing launch + equivalence tests pass unchanged.Stacked on
dataflow-up-20-online-async-loop(#625). Part 1/3 of the algorithm-organization cleanup; dflash + domino follow as stacked PRs.🤖 Generated with Claude Code