[DataFlow runtime] Domino end-to-end + StepContext for schedule-dependent loss by maocheng23 · Pull Request #629 · sgl-project/SpecForge

maocheng23 · 2026-06-30T19:07:30Z

What

Domino — the third algorithm on the composable launch. Beyond a StrategySpec it needs exactly one shared-contract extension: StepContext.

Changes

strategy.py: StepContext{global_step, total_steps} threaded into forward_loss (optional; eagle3/dflash ignore it). DominoTrainStrategy reuses the DFlash feature schema + adapter; its forward_loss reads ctx to compute the decaying lambda_base that blends Domino's base loss (mirrors train_domino.get_lambda_base).
trainer.py: TrainerCore.train_step / eval_step accept a StepContext; fit passes StepContext(global_step, total_steps=max_steps). Backward-compatible.
contracts.py: DraftStrategyName += "domino".
registry.py: domino spec — reuses DFlash transform/collate/adapter + a domino reader; DominoTrainStrategy. No new builder, no launch.py change.
tests/_fixtures.py: build_domino (DFlash draft w/ projector_type="domino" head → OnlineDominoModel).
tests/test_domino_launch.py (new): CPU lambda-schedule test + offline/online GPU end-to-end.

Adding domino touched zero launch.py and reused the dflash data path — the "new algorithm = a spec + its loss" goal. StepContext is the one genuine, deliberate contract change (vs. leaking schedule state through ad-hoc kwargs).

Testing

Part of the 197 tests OK suite run at this tip (sci-h200 / H200), incl. the new domino offline/online GPU tests + the CPU schedule test.

Stacked on the dflash PR. Part 3/3.

🤖 Generated with Claude Code

gemini-code-assist · 2026-06-30T19:07:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist · 2026-06-30T23:26:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 · 2026-06-30T23:27:06Z

Code review

No high-confidence issues found. Checked for bugs in the Domino StrategySpec, StepContext threading, lambda schedule parity with the script, and offline/online launcher wiring.

…dent loss Domino is the third algorithm on the composable launch — a StrategySpec plus the ONE genuine shared-contract extension the analysis predicted. - strategy.py: StepContext{global_step, total_steps} threaded into forward_loss (optional; eagle3/dflash ignore it). DominoTrainStrategy: reuses the DFlash feature schema + adapter; its forward_loss reads ctx to compute the decaying lambda_base that blends Domino's base loss (mirrors train_domino.get_lambda_base). - trainer.py: TrainerCore.train_step / eval_step accept a StepContext; fit passes StepContext(global_step, total_steps=max_steps). Backward-compatible. - contracts.py: DraftStrategyName += "domino". - registry.py: domino spec — reuses DFlash transform/collate/adapter, domino reader (strategy tag) + DominoTrainStrategy. No new builder, no launch.py change. - tests/_fixtures.py: build_domino (DFlash draft w/ projector_type="domino" head -> OnlineDominoModel). - tests/test_domino_launch.py (new): CPU lambda-schedule test + offline/online GPU end-to-end. Adding domino touched ZERO launch.py and reused the dflash data path — exactly the "new algorithm = a spec + its loss" goal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 · 2026-07-01T00:30:50Z

Addressed review feedback (self-review pass) — includes one real bug fix.

Bug: Domino base-loss schedule was silently disabled. fit() built StepContext(total_steps=self.max_steps), and max_steps defaults to None in every builder, so _lambda_base returned 0 for the whole run (blend collapsed to pure final loss). Added an explicit total_steps knob threaded builders → TrainerController → StepContext (falls back to max_steps), plus a TestStepContextThreading regression test that asserts fit() threads a real horizon.
Dedup: extracted the linear decay into strategy.linear_lambda_base, now the single source shared by DominoTrainStrategy._lambda_base and scripts/train_domino.get_lambda_base.
Folded in the test_trainer.py forward_loss(ctx=...) signature fix required by the StepContext change.

Deferred: exposing lambda_start/decay_ratio as launch-layer knobs (a config gap, not a bug; defaults match train_domino.py).

Validated: full tests/test_runtime = 200 OK (2 skipped, 1 xfail), zero failures, on a 2-node H200 pod. Lint clean (black 24.10.0 / isort 5.13.2 / autoflake).

maocheng23 mentioned this pull request Jun 30, 2026

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer) #630

Merged

maocheng23 marked this pull request as ready for review June 30, 2026 23:26

maocheng23 requested a review from FrankLeeeee as a code owner June 30, 2026 23:26

maocheng23 force-pushed the dataflow-up-22-dflash branch from 6945bfe to a6b8b7d Compare July 1, 2026 00:29

maocheng23 requested review from FlamingoPg, shuaills and sleepcoo as code owners July 1, 2026 00:29

maocheng23 force-pushed the dataflow-up-23-domino branch from 64eb276 to 9c1c020 Compare July 1, 2026 00:29

maocheng23 mentioned this pull request Jul 1, 2026

[DataFlow runtime] Phase B1 — TargetEngine ABC + de-EAGLE3 the target boundary #631

Merged

Base automatically changed from dataflow-up-22-dflash to dataflow-up-16-zerocopy July 2, 2026 05:47

jiapingW approved these changes Jul 3, 2026

View reviewed changes

jiapingW merged commit 9ede82d into dataflow-up-16-zerocopy Jul 3, 2026
1 check passed

jiapingW deleted the dataflow-up-23-domino branch July 3, 2026 01:49

maocheng23 mentioned this pull request Jul 4, 2026

Merge DataFlow runtime branch into main #648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataFlow runtime] Domino end-to-end + StepContext for schedule-dependent loss#629

[DataFlow runtime] Domino end-to-end + StepContext for schedule-dependent loss#629
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-23-domino

maocheng23 commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

maocheng23 commented Jun 30, 2026

Uh oh!

maocheng23 commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maocheng23 commented Jun 30, 2026

What

Changes

Testing

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

maocheng23 commented Jun 30, 2026

Code review

Uh oh!

maocheng23 commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants