Skip to content

[DataFlow runtime 6/7] Training: TrainerCore/Controller, FSDP backend, strategies#599

Merged
jiapingW merged 1 commit into
sgl-project:dataflow-up-5-inferencefrom
maocheng23:dataflow-up-6-training
Jun 25, 2026
Merged

[DataFlow runtime 6/7] Training: TrainerCore/Controller, FSDP backend, strategies#599
jiapingW merged 1 commit into
sgl-project:dataflow-up-5-inferencefrom
maocheng23:dataflow-up-6-training

Conversation

@maocheng23

@maocheng23 maocheng23 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

DataFlow runtime — stacked PR. Stacked on #598true-stacked: this PR's base is the previous PR's branch, so the diff below shows only this layer.

Part 6/7 — training (TrainerCore/Controller, FSDP backend, strategies).

Adds specforge/runtime/training/. TrainerCore/TrainerController is a branch-free loop (global_step counts optimizer steps; gradient accumulation; checkpoint/eval hooks; ack at the optimizer boundary). FSDPTrainingBackend + ParallelConfig wrap the model preserving TP + Ulysses/Ring SP, with the optimizer built over the wrapped module so FSDP is in the forward/backward path. Eagle3TrainStrategy (+ DFlashTrainStrategy) own loss/projection so the core stays branch-free (the FeatureSpec.target_repr tagged union selects hidden_state vs (pruned-)logits). Tests: test_trainer, test_seam_fixes. Additive.

Part of a 7-PR series adding the DataFlow runtime (specforge/runtime/, milestones M1–M4). Verified on current upstream main: all subpackages import and 65 component tests pass. The integration launcher (launch.py + train_eagle3_dataflow.py) and the end-to-end equivalence gates are a deliberate follow-up, not in this series.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ies)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-6-training branch from 6f47ecd to 2775580 Compare June 24, 2026 17:49
@maocheng23 maocheng23 changed the base branch from main to dataflow-up-5-inference June 25, 2026 00:14
@jiapingW jiapingW self-requested a review June 25, 2026 08:47
@jiapingW jiapingW merged commit 39b3d11 into sgl-project:dataflow-up-5-inference Jun 25, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants