[DataFlow runtime] Phase B3 — domain Trainer wrapping the runtime seam#633
Merged
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Introduce the domain training layer `specforge/training/` with a caller-facing
`Trainer` that composes the whole training spine behind one object + `.fit()`:
FeatureDataLoader + FSDPTrainingBackend.prepare_model (FSDP wrap)
+ spec.make_strategy -> TrainerCore -> TrainerController
`Trainer` is the canonical assembler now; `launch._assemble_trainer` delegates to
it and returns the same `(TrainerController, FeatureDataLoader)` tuple, so every
`build_*_runtime` path is byte-for-byte unchanged (no fork — one wiring path).
The runtime seam (TrainerController / TrainerCore / DraftTrainStrategy /
FSDPTrainingBackend) is untouched; this is the domain facade over it. Topology
(offline/online/disagg) stays invisible to Trainer — absorbed by the (ref source
+ FeatureStore) it is handed. No HiddenStateStream: the loader is the stream.
- specforge/training/{__init__.py (PEP 562 lazy Trainer export), trainer.py}
- launch.py: _assemble_trainer delegates to Trainer; drops the now-unused
FeatureDataLoader / FSDPTrainingBackend / ParallelConfig / TrainerCore /
TrainerController imports.
- tests/test_runtime/test_domain_trainer.py: fakes the runtime pieces and asserts
the composition (refs enqueued, loader/backend/core/controller args, ack_fn
wired to the DataFlowController, .fit() delegates over the loader).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4e3c214 to
1d9060e
Compare
Base automatically changed from
dataflow-up-25-sglang-capture-backend
to
dataflow-up-16-zerocopy
July 3, 2026 02:12
jiapingW
approved these changes
Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase B (domain abstractions) — 3/3. Stacked on #632 (B2).
Introduces the domain training layer
specforge/training/with a caller-facingTrainerthat composes the whole spine behind one object +.fit():Traineris the canonical assembler now;launch._assemble_trainerdelegates to it and returns the same(TrainerController, FeatureDataLoader)tuple, so everybuild_*_runtimepath is byte-for-byte unchanged (one wiring path, no fork). The runtime seam (TrainerController/TrainerCore/DraftTrainStrategy/FSDPTrainingBackend) is untouched — this is the domain facade over it. Topology (offline/online/disagg) stays invisible toTrainer; it's absorbed by the (ref source +FeatureStore) it's handed. NoHiddenStateStream— the loader is the stream.specforge/training/{__init__.py, trainer.py}(new)launch.py:_assemble_trainerdelegates toTrainer; drops the now-unusedFeatureDataLoader/FSDPTrainingBackend/ParallelConfig/TrainerCore/TrainerControllerimports.tests/test_runtime/test_domain_trainer.py— fakes the runtime pieces and asserts the composition (refs enqueued, loader/backend/core/controller args,ack_fnwired to the DataFlowController,.fit()delegates over the loader).Validation
Full
tests/test_runtime214 OK (2 skip, 1 xfail) on 8×H200 — the existing launch/equivalence tests now run through thisTrainer, so byte-identical loss is covered. Adversarial review: 0 confirmed defects.Known gap (not blocking)
No single test drives the online loop with a real SGLang target (
backend="sglang"→ capture backend → adapter → RolloutWorker →fit); the sglang capture is validated end-to-end at the capture level (parity test) and the online rollout→train loop is validated with an HF target, but not combined. Can add on request.🤖 Generated with Claude Code