Skip to content

[DataFlow runtime] Domino end-to-end + StepContext for schedule-dependent loss#629

Merged
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-23-domino
Jul 3, 2026
Merged

[DataFlow runtime] Domino end-to-end + StepContext for schedule-dependent loss#629
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-23-domino

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

What

Domino — the third algorithm on the composable launch. Beyond a StrategySpec it needs exactly one shared-contract extension: StepContext.

Changes

  • strategy.py: StepContext{global_step, total_steps} threaded into forward_loss (optional; eagle3/dflash ignore it). DominoTrainStrategy reuses the DFlash feature schema + adapter; its forward_loss reads ctx to compute the decaying lambda_base that blends Domino's base loss (mirrors train_domino.get_lambda_base).
  • trainer.py: TrainerCore.train_step / eval_step accept a StepContext; fit passes StepContext(global_step, total_steps=max_steps). Backward-compatible.
  • contracts.py: DraftStrategyName += "domino".
  • registry.py: domino spec — reuses DFlash transform/collate/adapter + a domino reader; DominoTrainStrategy. No new builder, no launch.py change.
  • tests/_fixtures.py: build_domino (DFlash draft w/ projector_type="domino" head → OnlineDominoModel).
  • tests/test_domino_launch.py (new): CPU lambda-schedule test + offline/online GPU end-to-end.

Adding domino touched zero launch.py and reused the dflash data path — the "new algorithm = a spec + its loss" goal. StepContext is the one genuine, deliberate contract change (vs. leaking schedule state through ad-hoc kwargs).

Testing

Part of the 197 tests OK suite run at this tip (sci-h200 / H200), incl. the new domino offline/online GPU tests + the CPU schedule test.

Stacked on the dflash PR. Part 3/3.

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 added a commit that referenced this pull request Jun 30, 2026
… W3′ naming

Review fixes (verified against the files):
- Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629)
  "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one
  consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A).
  Leave the spine's "landed" wording (it is merged).
- Module placement (confirmed): Evaluator/EvalCache are top-level domain managers
  (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match
  plan.md §2.3 and domain-refactor.md.
- W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports
  (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2,
  the workload table and §G2 rather than overloading one name.
- O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid
  narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3
  extraction and domain Trainer carry no engine risk.

Additional contradictions found by a completeness sweep and fixed:
- StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E
  move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/.
- TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not
  "absorbs runtime/inference adapters".
- Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames.
- Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel).
- Drop the up-19/up-20 branch tags that only appeared in the online doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 marked this pull request as ready for review June 30, 2026 23:26
@maocheng23 maocheng23 requested a review from FrankLeeeee as a code owner June 30, 2026 23:26
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maocheng23

Copy link
Copy Markdown
Collaborator Author

Code review

No high-confidence issues found. Checked for bugs in the Domino StrategySpec, StepContext threading, lambda schedule parity with the script, and offline/online launcher wiring.

…dent loss

Domino is the third algorithm on the composable launch — a StrategySpec plus the
ONE genuine shared-contract extension the analysis predicted.

- strategy.py: StepContext{global_step, total_steps} threaded into forward_loss
  (optional; eagle3/dflash ignore it). DominoTrainStrategy: reuses the DFlash
  feature schema + adapter; its forward_loss reads ctx to compute the decaying
  lambda_base that blends Domino's base loss (mirrors train_domino.get_lambda_base).
- trainer.py: TrainerCore.train_step / eval_step accept a StepContext; fit passes
  StepContext(global_step, total_steps=max_steps). Backward-compatible.
- contracts.py: DraftStrategyName += "domino".
- registry.py: domino spec — reuses DFlash transform/collate/adapter, domino reader
  (strategy tag) + DominoTrainStrategy. No new builder, no launch.py change.
- tests/_fixtures.py: build_domino (DFlash draft w/ projector_type="domino" head ->
  OnlineDominoModel).
- tests/test_domino_launch.py (new): CPU lambda-schedule test + offline/online GPU
  end-to-end.

Adding domino touched ZERO launch.py and reused the dflash data path — exactly the
"new algorithm = a spec + its loss" goal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23

Copy link
Copy Markdown
Collaborator Author

Addressed review feedback (self-review pass) — includes one real bug fix.

  • Bug: Domino base-loss schedule was silently disabled. fit() built StepContext(total_steps=self.max_steps), and max_steps defaults to None in every builder, so _lambda_base returned 0 for the whole run (blend collapsed to pure final loss). Added an explicit total_steps knob threaded builders → TrainerControllerStepContext (falls back to max_steps), plus a TestStepContextThreading regression test that asserts fit() threads a real horizon.
  • Dedup: extracted the linear decay into strategy.linear_lambda_base, now the single source shared by DominoTrainStrategy._lambda_base and scripts/train_domino.get_lambda_base.
  • Folded in the test_trainer.py forward_loss(ctx=...) signature fix required by the StepContext change.

Deferred: exposing lambda_start/decay_ratio as launch-layer knobs (a config gap, not a bug; defaults match train_domino.py).

Validated: full tests/test_runtime = 200 OK (2 skipped, 1 xfail), zero failures, on a 2-node H200 pod. Lint clean (black 24.10.0 / isort 5.13.2 / autoflake).

Base automatically changed from dataflow-up-22-dflash to dataflow-up-16-zerocopy July 2, 2026 05:47
@jiapingW jiapingW merged commit 9ede82d into dataflow-up-16-zerocopy Jul 3, 2026
1 check passed
@jiapingW jiapingW deleted the dataflow-up-23-domino branch July 3, 2026 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants