Skip to content

[DataFlow runtime] Online EAGLE3 launcher (build_online_eagle3_runtime + RolloutWorker)#601

Merged
jiapingW merged 1 commit into
sgl-project:dataflow-up-7-integrationfrom
maocheng23:dataflow-up-8-online
Jun 25, 2026
Merged

[DataFlow runtime] Online EAGLE3 launcher (build_online_eagle3_runtime + RolloutWorker)#601
jiapingW merged 1 commit into
sgl-project:dataflow-up-7-integrationfrom
maocheng23:dataflow-up-8-online

Conversation

@maocheng23

@maocheng23 maocheng23 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

DataFlow runtime — online EAGLE3 launcher. Stacked on #600true-stacked: this PR's base is the previous PR's branch, so the diff below shows only this layer.

Adds the online counterpart of the offline launcher: training features are produced in-loop by a rollout worker instead of read from disk.

What

  • specforge/runtime/launch.pybuild_online_eagle3_runtime: mirrors build_offline_eagle3_runtime, but the SampleRef producer is a RolloutWorker + SGLangAdapter over the target's generate_eagle3_data (any backend — HF, SGLang, or custom) → mem:// FeatureStoreSampleRefQueueFeatureDataLoader → trainer. Returns (trainer, loader, workers, controller, drive_rollout); target_head=None and target_repr="logits" (online already materialized the target distribution).
  • scripts/train_eagle3_dataflow.py — online branch: when --train-hidden-states-path is absent, build the target (is_online=True), ingest prompts, drive_rollout() to populate the queue, then trainer.fit.
  • tests/test_runtime/test_online_launch.py — single-rank launcher e2e (HF target, no sglang): rollout → mem:// store → FSDP train, asserting the controller carries no tensors and optimizer-step semantics.

The online old-vs-new bit-exact math is covered by test_equiv_online_eagle3 (in the integration PR).

How to run the full 7B old-vs-new online comparison

Online generates features in-loop (no precompute). Old = is_online (prompts, no hidden-states path); new = this launcher's online branch.

M="Qwen/Qwen2.5-7B-Instruct"; C="configs/qwen2.5-7b-eagle3.json"
ARGS="--target-model-path $M --draft-model-config $C --train-data-path prompts.jsonl \
      --target-model-backend hf --chat-template qwen --max-num-steps 200 --batch-size 1 --seed 0"
# old (in-loop generate_eagle3_data)
torchrun --standalone --nproc_per_node 1 scripts/train_eagle3.py          $ARGS --output-dir out_old_online
# new (RolloutWorker; online branch auto-selected since --train-hidden-states-path is absent)
torchrun --standalone --nproc_per_node 1 scripts/train_eagle3_dataflow.py  $ARGS --output-dir out_new_online
# NOTE: one rollout pass = #prompts steps, so provide >= max_steps prompts (e.g. 256 for 200 steps).

Results — Qwen2.5-7B, 200 steps, HF backend, in-loop target forward

step old loss / new old acc / new old accept / new old grad / new
1 5.27 / 5.51 0.00 / 0.00 0.00 / 0.01 42 / 37
100 2.16 / 1.83 0.27 / 0.40 0.21 / 0.33 22 / 21
200 1.07 / 1.15 0.67 / 0.71 0.61 / 0.58 11 / 14

Old and new converge together (loss ≈ 1.1, acc ≈ 0.7, acceptance ≈ 0.6); both exit cleanly (rc=0, confirming the destroy_distributed hardening). Online converges far better than offline here because it generates fresh features over 256 diverse prompts rather than reusing a small precomputed set.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maocheng23 maocheng23 changed the base branch from main to dataflow-up-7-integration June 25, 2026 00:15
@maocheng23 maocheng23 force-pushed the dataflow-up-8-online branch from e435968 to a3b1517 Compare June 25, 2026 00:38
@maocheng23 maocheng23 force-pushed the dataflow-up-7-integration branch from 1cc7c69 to ea463fc Compare June 25, 2026 00:38
@maocheng23 maocheng23 force-pushed the dataflow-up-8-online branch from a3b1517 to 4358686 Compare June 25, 2026 00:57
@maocheng23 maocheng23 force-pushed the dataflow-up-7-integration branch from ea463fc to d005a13 Compare June 25, 2026 00:57
…tWorker)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-8-online branch from 4358686 to d3c5da9 Compare June 25, 2026 01:26
@maocheng23 maocheng23 force-pushed the dataflow-up-7-integration branch from d005a13 to 7a81ce5 Compare June 25, 2026 01:26
@jiapingW jiapingW self-requested a review June 25, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants