[DataFlow runtime] Online EAGLE3 launcher (build_online_eagle3_runtime + RolloutWorker)#601
Merged
jiapingW merged 1 commit intoJun 25, 2026
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
e435968 to
a3b1517
Compare
1cc7c69 to
ea463fc
Compare
a3b1517 to
4358686
Compare
ea463fc to
d005a13
Compare
…tWorker) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4358686 to
d3c5da9
Compare
d005a13 to
7a81ce5
Compare
jiapingW
approved these changes
Jun 25, 2026
This was referenced Jun 26, 2026
Merged
Merged
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DataFlow runtime — online EAGLE3 launcher. Stacked on #600 — true-stacked: this PR's base is the previous PR's branch, so the diff below shows only this layer.
Adds the online counterpart of the offline launcher: training features are produced in-loop by a rollout worker instead of read from disk.
What
specforge/runtime/launch.py—build_online_eagle3_runtime: mirrorsbuild_offline_eagle3_runtime, but theSampleRefproducer is aRolloutWorker+SGLangAdapterover the target'sgenerate_eagle3_data(any backend — HF, SGLang, or custom) →mem://FeatureStore→SampleRefQueue→FeatureDataLoader→ trainer. Returns(trainer, loader, workers, controller, drive_rollout);target_head=Noneandtarget_repr="logits"(online already materialized the target distribution).scripts/train_eagle3_dataflow.py— online branch: when--train-hidden-states-pathis absent, build the target (is_online=True), ingest prompts,drive_rollout()to populate the queue, thentrainer.fit.tests/test_runtime/test_online_launch.py— single-rank launcher e2e (HF target, no sglang): rollout → mem:// store → FSDP train, asserting the controller carries no tensors and optimizer-step semantics.The online old-vs-new bit-exact math is covered by
test_equiv_online_eagle3(in the integration PR).How to run the full 7B old-vs-new online comparison
Online generates features in-loop (no precompute). Old =
is_online(prompts, no hidden-states path); new = this launcher's online branch.Results — Qwen2.5-7B, 200 steps, HF backend, in-loop target forward
Old and new converge together (loss ≈ 1.1, acc ≈ 0.7, acceptance ≈ 0.6); both exit cleanly (
rc=0, confirming thedestroy_distributedhardening). Online converges far better than offline here because it generates fresh features over 256 diverse prompts rather than reusing a small precomputed set.