Skip to content

runtime (7/7): integration — launcher + end-to-end equivalence gates#8

Closed
maocheng23 wants to merge 3 commits into
stack/6-trainingfrom
stack/7-integration
Closed

runtime (7/7): integration — launcher + end-to-end equivalence gates#8
maocheng23 wants to merge 3 commits into
stack/6-trainingfrom
stack/7-integration

Conversation

@maocheng23

Copy link
Copy Markdown
Owner

Part 7/7 of the DataFlow runtime stack (M1–M4). Review bottom-up.

Wires the offline-EAGLE3 dataflow (launch.py + thin train_eagle3_dataflow.py) + README + the cross-layer GPU gates: offline/online/trainer-split/checkpoint equivalence, extraction-vs-HF, and the FSDP launcher path. Validated 62/62 on H200.

  • Base: stack/6-trainingHead: stack/7-integration
  • CPU tests at this layer: 62 (7 GPU-gated) (cumulative unittest discover tests/test_runtime, GPU tests skipUnless(cuda)).
  • Stack order: 1 sglang-guard → 2 contracts → 3 data-plane → 4 control-plane → 5 inference → 6 training → 7 integration.

Each layer compiles and passes its own CPU tests; the cross-layer GPU equivalence gates land in 7/7. Split out of the monolithic PR for reviewability.

🤖 Generated with Claude Code

Wire the offline-EAGLE3 dataflow (launch.py + thin train_eagle3_dataflow.py),
README, and the cross-layer GPU gates: offline/online/trainer-split/checkpoint
equivalence, extraction-vs-HF, and the FSDP launcher path.
maocheng23 and others added 2 commits June 20, 2026 16:05
Whole-system map (autonomous compute loops, a passive metadata-only control
plane, a tensor-only data plane), an endpoint reference table, and the autonomy
model. Endpoints verified against source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stroy_distributed

- train_eagle3_dataflow.py: parse_args() does not set args.target_batch_size
  (train_eagle3.main derives it inline), so the offline dataflow launcher
  crashed with AttributeError in build_dataloaders before training. Derive it
  right after parse_args, mirroring train_eagle3.
- distributed.py: destroy_distributed() raised "Invalid process group specified"
  during teardown when a group is None or already destroyed, making a successful
  run exit non-zero. Destroy each group defensively so cleanup never crashes.

Both surfaced by a real Qwen2.5-7B offline old-vs-new e2e training run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23

Copy link
Copy Markdown
Owner Author

Superseded: merged upstream as sgl-project#600. Closing this fork-internal review PR.

@maocheng23 maocheng23 closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant