Skip to content

[DataFlow runtime 7/7] Integration: launcher + end-to-end equivalence gates#600

Merged
jiapingW merged 1 commit into
sgl-project:dataflow-up-6-trainingfrom
maocheng23:dataflow-up-7-integration
Jun 25, 2026
Merged

[DataFlow runtime 7/7] Integration: launcher + end-to-end equivalence gates#600
jiapingW merged 1 commit into
sgl-project:dataflow-up-6-trainingfrom
maocheng23:dataflow-up-7-integration

Conversation

@maocheng23

@maocheng23 maocheng23 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

DataFlow runtime — part 7/7 (integration). Stacked on #599true-stacked: this PR's base is the previous PR's branch, so the diff below shows only this layer.

Turns the runtime into a thin launcher and adds the end-to-end equivalence gates.

What

  • specforge/runtime/launch.pybuild_offline_eagle3_runtime: assembles OfflineManifestReader → DataFlowController → LocalFeatureStore → FeatureDataLoader → Eagle3TrainStrategy → TrainerController/Core → FSDP.
  • scripts/train_eagle3_dataflow.py — thin offline launcher; reuses train_eagle3's model/data builders (no training logic in the script).
  • GPU equivalence gates (CPU-stub-importable, @skipUnless(cuda) for the GPU ones): test_equiv_offline_eagle3 (old run_forward vs new Eagle3TrainStrategy.forward_loss, bit-exact per-batch loss), test_equiv_online_eagle3, test_equiv_trainer_split, test_offline_launch_fsdp, test_checkpoint_resume, test_extraction_vs_hf_reference, plus _fixtures.py.
  • Two launcher robustness fixes surfaced by a real 7B run: derive args.target_batch_size in the dataflow launcher (it was read before being set → crash); harden destroy_distributed() against None/already-destroyed groups so a successful run does not exit non-zero on teardown.
  • Docs: runtime/README.md, runtime/ARCHITECTURE.md.

How to run the full 7B old-vs-new offline comparison

# 0) Offline features (.ckpt) — either scripts/prepare_hidden_states.py (sglang),
#    or HF-only: run the target with output_hidden_states and save per prompt
#      input_ids:(seq,)  loss_mask:(seq,)
#      hidden_state:(1,seq,H)      = hidden_states[-1]                (lm-head input)
#      aux_hidden_state:(1,seq,3H) = cat(layers [1, L//2-1, L-4])     (default aux ids)
#    into <i>.ckpt (same format as prepare_hidden_states.DataPoint).
M="Qwen/Qwen2.5-7B-Instruct"; C="configs/qwen2.5-7b-eagle3.json"
ARGS="--target-model-path $M --draft-model-config $C --train-data-path prompts.jsonl \
      --train-hidden-states-path feats/ --target-model-backend hf --chat-template qwen \
      --max-num-steps 200 --batch-size 1 --seed 0"
# old path
torchrun --standalone --nproc_per_node 1 scripts/train_eagle3.py          $ARGS --output-dir out_old
# new path (identical args)
torchrun --standalone --nproc_per_node 1 scripts/train_eagle3_dataflow.py  $ARGS --output-dir out_new
# then diff per-step loss / acc / grad_norm from the two logs.

Results — Qwen2.5-7B, 200 steps, HF backend, seed 0 (offline)

step old loss / new old acc / new old accept / new old grad / new
1 5.51 / 5.11 0.00 / 0.00 0.11 / 0.11 13.7 / 14.1
100 4.37 / 4.28 0.54 / 0.55 0.21 / 0.19 2.3 / 3.0
200 4.17 / 4.14 0.77 / 0.69 0.24 / 0.23 5.1 / 5.2

Old and new converge to the same point (loss ≈ 4.15, acc ≈ 0.7, acceptance ≈ 0.23, grad ≈ 5). Per-step values are not bit-identical because the two paths iterate samples in different order and report loss slightly differently; test_equiv_offline_eagle3 isolates the per-batch math as bit-exact.

Part of an 8-PR stack adding the DataFlow runtime (M1–M4 + integration). Verified on current main: imports + full tests/test_runtime pass.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maocheng23 maocheng23 changed the base branch from main to dataflow-up-6-training June 25, 2026 00:15
@maocheng23 maocheng23 force-pushed the dataflow-up-7-integration branch 2 times, most recently from ea463fc to d005a13 Compare June 25, 2026 00:57
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-7-integration branch from d005a13 to 7a81ce5 Compare June 25, 2026 01:26
@jiapingW jiapingW self-requested a review June 25, 2026 08:49
@jiapingW jiapingW merged commit 8b4db4f into sgl-project:dataflow-up-6-training Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants