Skip to content

[DataFlow runtime] Phase B3 — domain Trainer wrapping the runtime seam#633

Merged
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-26-domain-trainer
Jul 3, 2026
Merged

[DataFlow runtime] Phase B3 — domain Trainer wrapping the runtime seam#633
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-26-domain-trainer

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

Phase B (domain abstractions) — 3/3. Stacked on #632 (B2).

Introduces the domain training layer specforge/training/ with a caller-facing Trainer that composes the whole spine behind one object + .fit():

FeatureDataLoader  +  FSDPTrainingBackend.prepare_model (FSDP wrap)
                   +  spec.make_strategy -> TrainerCore -> TrainerController

Trainer is the canonical assembler now; launch._assemble_trainer delegates to it and returns the same (TrainerController, FeatureDataLoader) tuple, so every build_*_runtime path is byte-for-byte unchanged (one wiring path, no fork). The runtime seam (TrainerController / TrainerCore / DraftTrainStrategy / FSDPTrainingBackend) is untouched — this is the domain facade over it. Topology (offline/online/disagg) stays invisible to Trainer; it's absorbed by the (ref source + FeatureStore) it's handed. No HiddenStateStream — the loader is the stream.

  • specforge/training/{__init__.py, trainer.py} (new)
  • launch.py: _assemble_trainer delegates to Trainer; drops the now-unused FeatureDataLoader/FSDPTrainingBackend/ParallelConfig/TrainerCore/TrainerController imports.
  • New test: tests/test_runtime/test_domain_trainer.py — fakes the runtime pieces and asserts the composition (refs enqueued, loader/backend/core/controller args, ack_fn wired to the DataFlowController, .fit() delegates over the loader).

Validation

Full tests/test_runtime 214 OK (2 skip, 1 xfail) on 8×H200 — the existing launch/equivalence tests now run through this Trainer, so byte-identical loss is covered. Adversarial review: 0 confirmed defects.

Known gap (not blocking)

No single test drives the online loop with a real SGLang target (backend="sglang" → capture backend → adapter → RolloutWorker → fit); the sglang capture is validated end-to-end at the capture level (parity test) and the online rollout→train loop is validated with an HF target, but not combined. Can add on request.

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Introduce the domain training layer `specforge/training/` with a caller-facing
`Trainer` that composes the whole training spine behind one object + `.fit()`:

    FeatureDataLoader  +  FSDPTrainingBackend.prepare_model (FSDP wrap)
                       +  spec.make_strategy -> TrainerCore -> TrainerController

`Trainer` is the canonical assembler now; `launch._assemble_trainer` delegates to
it and returns the same `(TrainerController, FeatureDataLoader)` tuple, so every
`build_*_runtime` path is byte-for-byte unchanged (no fork — one wiring path).
The runtime seam (TrainerController / TrainerCore / DraftTrainStrategy /
FSDPTrainingBackend) is untouched; this is the domain facade over it. Topology
(offline/online/disagg) stays invisible to Trainer — absorbed by the (ref source
+ FeatureStore) it is handed. No HiddenStateStream: the loader is the stream.

- specforge/training/{__init__.py (PEP 562 lazy Trainer export), trainer.py}
- launch.py: _assemble_trainer delegates to Trainer; drops the now-unused
  FeatureDataLoader / FSDPTrainingBackend / ParallelConfig / TrainerCore /
  TrainerController imports.
- tests/test_runtime/test_domain_trainer.py: fakes the runtime pieces and asserts
  the composition (refs enqueued, loader/backend/core/controller args, ack_fn
  wired to the DataFlowController, .fit() delegates over the loader).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-26-domain-trainer branch from 4e3c214 to 1d9060e Compare July 1, 2026 06:34
@maocheng23 maocheng23 marked this pull request as ready for review July 1, 2026 08:22
@maocheng23 maocheng23 requested a review from FrankLeeeee as a code owner July 1, 2026 08:22
Base automatically changed from dataflow-up-25-sglang-capture-backend to dataflow-up-16-zerocopy July 3, 2026 02:12
@jiapingW jiapingW self-requested a review July 3, 2026 02:17
@jiapingW jiapingW merged commit c0234ed into dataflow-up-16-zerocopy Jul 3, 2026
1 check passed
@jiapingW jiapingW deleted the dataflow-up-26-domain-trainer branch July 3, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants