[DataFlow runtime 6/7] Training: TrainerCore/Controller, FSDP backend, strategies by maocheng23 · Pull Request #599 · sgl-project/SpecForge

maocheng23 · 2026-06-24T17:37:46Z

DataFlow runtime — stacked PR. Stacked on #598 — true-stacked: this PR's base is the previous PR's branch, so the diff below shows only this layer.

Part 6/7 — training (TrainerCore/Controller, FSDP backend, strategies).

Adds specforge/runtime/training/. TrainerCore/TrainerController is a branch-free loop (global_step counts optimizer steps; gradient accumulation; checkpoint/eval hooks; ack at the optimizer boundary). FSDPTrainingBackend + ParallelConfig wrap the model preserving TP + Ulysses/Ring SP, with the optimizer built over the wrapped module so FSDP is in the forward/backward path. Eagle3TrainStrategy (+ DFlashTrainStrategy) own loss/projection so the core stays branch-free (the FeatureSpec.target_repr tagged union selects hidden_state vs (pruned-)logits). Tests: test_trainer, test_seam_fixes. Additive.

Part of a 7-PR series adding the DataFlow runtime (specforge/runtime/, milestones M1–M4). Verified on current upstream main: all subpackages import and 65 component tests pass. The integration launcher (launch.py + train_eagle3_dataflow.py) and the end-to-end equivalence gates are a deliberate follow-up, not in this series.

gemini-code-assist · 2026-06-24T17:38:25Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ies) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners June 24, 2026 17:37

runtime(6/7): training (TrainerCore/Controller, FSDP backend, strateg…

2775580

…ies) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 force-pushed the dataflow-up-6-training branch from 6f47ecd to 2775580 Compare June 24, 2026 17:49

maocheng23 mentioned this pull request Jun 24, 2026

[DataFlow runtime 7/7] Integration: launcher + end-to-end equivalence gates #600

Merged

maocheng23 changed the base branch from main to dataflow-up-5-inference June 25, 2026 00:14

jiapingW self-requested a review June 25, 2026 08:47

jiapingW approved these changes Jun 25, 2026

View reviewed changes

jiapingW merged commit 39b3d11 into sgl-project:dataflow-up-5-inference Jun 25, 2026
2 checks passed

maocheng23 mentioned this pull request Jun 26, 2026

runtime (6/7): training (TrainerCore/Controller, FSDP backend, strategies) maocheng23/SpecForge#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataFlow runtime 6/7] Training: TrainerCore/Controller, FSDP backend, strategies#599

[DataFlow runtime 6/7] Training: TrainerCore/Controller, FSDP backend, strategies#599
jiapingW merged 1 commit into
sgl-project:dataflow-up-5-inferencefrom
maocheng23:dataflow-up-6-training

maocheng23 commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maocheng23 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maocheng23 commented Jun 24, 2026 •

edited

Loading