[DataFlow runtime · M6 2/4] Disaggregated offline EAGLE3 example + build_disagg_eagle3_runtime#610
Merged
jiapingW merged 7 commits intoJun 29, 2026
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Adds the consumer/producer assembly for the M6 disaggregation seam (SharedDirFeatureStore), plus a runnable 2-node example: - launch.py: build_disagg_eagle3_runtime (consumer side) + factor the shared offline trainer assembly out of build_offline_eagle3_runtime, so colocated and disaggregated paths produce byte-identical batches/training. - data_plane/disagg_ingest.py: ingest_offline_features (producer: load .ckpt -> SharedDirFeatureStore.put) + JSON ref-manifest (the tensor-free metadata bridge between pools; asserts the no-tensor invariant). - examples/disagg/: run_disagg_eagle3.py (role-branched producer/consumer driver), run_qwen2.5_7b_eagle3_disagg.sh (rcli --per-node wrapper), README. - tests/test_runtime/test_disagg_launch.py: CPU bit-exact differential (disagg store serves identical tensors to the colocated path; manifest round-trips tensor-free; B9 auth) + a GPU FSDP train smoke. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The thin launchers skip sanity_check(); the train_eagle3 builders read args.target_batch_size/dp_size which only sanity_check derives. Call it on the consumer after init_distributed (it needs the process group). Also wire chat-template/cache-dir/learning-rate into the rcli wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Thread log_interval through build_offline/build_disagg_eagle3_runtime (default 50) so the example can emit a finer training curve; driver logs every 25 steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Offline training re-iterates the ref set across epochs, but SharedDirFeatureStore consume-once-frees on release() -> epoch 2 get() raised KeyError. Add retain_on_release (read-only mode): release() drops the lease but keeps the file, mirroring LocalFeatureStore's file:// no-op release. The disagg consumer sets it; online rollout keeps consume-once (default False). Whole-store cleanup at run end. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
DISAGG_ROLE=colocated runs the SAME model build + assembly via build_offline_eagle3_runtime (LocalFeatureStore), so disagg vs colocated can be compared on identical features/seed. Factored the shared model/optimizer build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Disagg consumer vs colocated baseline on Qwen2.5-7B (identical features/seed): training metrics (acceptance_rate/ploss/acc) match to ~5 sig figs; residual is GPU floating-point noise, not the transport. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lint-only: formats the files this PR adds/changes; no behavior change. The shell wrapper is marked executable (check-shebang-scripts-are-executable). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
e7bb5e1 to
31b823f
Compare
Base automatically changed from
dataflow-up-12-m6-disagg
to
dataflow-up-11-m5-recovery
June 29, 2026 15:56
jiapingW
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the disaggregated offline EAGLE3 example: build_disagg_eagle3_runtime + _assemble_offline_eagle3, ingest_offline_features + tensor-free ref manifest, producer/consumer runner. Bit-exact vs colocated.
Part of the DataFlow runtime M5/M6 stacked series (continues the M1–M4 work in #594–#601 / #603). Stacked PRs — merge bottom-up (up-9 first). Lint (pre-commit) + runtime CPU test suite green.
🤖 Generated with Claude Code