[DataFlow runtime] Online EAGLE3 launcher (build_online_eagle3_runtime + RolloutWorker) by maocheng23 · Pull Request #601 · sgl-project/SpecForge

maocheng23 · 2026-06-24T21:36:33Z

DataFlow runtime — online EAGLE3 launcher. Stacked on #600 — true-stacked: this PR's base is the previous PR's branch, so the diff below shows only this layer.

Adds the online counterpart of the offline launcher: training features are produced in-loop by a rollout worker instead of read from disk.

What

specforge/runtime/launch.py — build_online_eagle3_runtime: mirrors build_offline_eagle3_runtime, but the SampleRef producer is a RolloutWorker + SGLangAdapter over the target's generate_eagle3_data (any backend — HF, SGLang, or custom) → mem:// FeatureStore → SampleRefQueue → FeatureDataLoader → trainer. Returns (trainer, loader, workers, controller, drive_rollout); target_head=None and target_repr="logits" (online already materialized the target distribution).
scripts/train_eagle3_dataflow.py — online branch: when --train-hidden-states-path is absent, build the target (is_online=True), ingest prompts, drive_rollout() to populate the queue, then trainer.fit.
tests/test_runtime/test_online_launch.py — single-rank launcher e2e (HF target, no sglang): rollout → mem:// store → FSDP train, asserting the controller carries no tensors and optimizer-step semantics.

The online old-vs-new bit-exact math is covered by test_equiv_online_eagle3 (in the integration PR).

How to run the full 7B old-vs-new online comparison

Online generates features in-loop (no precompute). Old = is_online (prompts, no hidden-states path); new = this launcher's online branch.

M="Qwen/Qwen2.5-7B-Instruct"; C="configs/qwen2.5-7b-eagle3.json"
ARGS="--target-model-path $M --draft-model-config $C --train-data-path prompts.jsonl \
      --target-model-backend hf --chat-template qwen --max-num-steps 200 --batch-size 1 --seed 0"
# old (in-loop generate_eagle3_data)
torchrun --standalone --nproc_per_node 1 scripts/train_eagle3.py          $ARGS --output-dir out_old_online
# new (RolloutWorker; online branch auto-selected since --train-hidden-states-path is absent)
torchrun --standalone --nproc_per_node 1 scripts/train_eagle3_dataflow.py  $ARGS --output-dir out_new_online
# NOTE: one rollout pass = #prompts steps, so provide >= max_steps prompts (e.g. 256 for 200 steps).

Results — Qwen2.5-7B, 200 steps, HF backend, in-loop target forward

step	old loss / new	old acc / new	old accept / new	old grad / new
1	5.27 / 5.51	0.00 / 0.00	0.00 / 0.01	42 / 37
100	2.16 / 1.83	0.27 / 0.40	0.21 / 0.33	22 / 21
200	1.07 / 1.15	0.67 / 0.71	0.61 / 0.58	11 / 14

Old and new converge together (loss ≈ 1.1, acc ≈ 0.7, acceptance ≈ 0.6); both exit cleanly (rc=0, confirming the destroy_distributed hardening). Online converges far better than offline here because it generates fresh features over 256 diverse prompts rather than reusing a small precomputed set.

gemini-code-assist · 2026-06-24T21:36:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…tWorker) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners June 24, 2026 21:36

maocheng23 changed the base branch from main to dataflow-up-7-integration June 25, 2026 00:15

maocheng23 force-pushed the dataflow-up-8-online branch from e435968 to a3b1517 Compare June 25, 2026 00:38

maocheng23 force-pushed the dataflow-up-7-integration branch from 1cc7c69 to ea463fc Compare June 25, 2026 00:38

maocheng23 force-pushed the dataflow-up-8-online branch from a3b1517 to 4358686 Compare June 25, 2026 00:57

maocheng23 force-pushed the dataflow-up-7-integration branch from ea463fc to d005a13 Compare June 25, 2026 00:57

runtime: online EAGLE3 launcher (build_online_eagle3_runtime + Rollou…

d3c5da9

…tWorker) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 force-pushed the dataflow-up-8-online branch from 4358686 to d3c5da9 Compare June 25, 2026 01:26

maocheng23 force-pushed the dataflow-up-7-integration branch from d005a13 to 7a81ce5 Compare June 25, 2026 01:26

jiapingW self-requested a review June 25, 2026 08:51

jiapingW approved these changes Jun 25, 2026

View reviewed changes

jiapingW merged commit 966fba1 into sgl-project:dataflow-up-7-integration Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataFlow runtime] Online EAGLE3 launcher (build_online_eagle3_runtime + RolloutWorker)#601

[DataFlow runtime] Online EAGLE3 launcher (build_online_eagle3_runtime + RolloutWorker)#601
jiapingW merged 1 commit into
sgl-project:dataflow-up-7-integrationfrom
maocheng23:dataflow-up-8-online

maocheng23 commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maocheng23 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How to run the full 7B old-vs-new online comparison

Results — Qwen2.5-7B, 200 steps, HF backend, in-loop target forward

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maocheng23 commented Jun 24, 2026 •

edited

Loading