Skip to content

[DataFlow runtime · M6 2/4] Disaggregated offline EAGLE3 example + build_disagg_eagle3_runtime#610

Merged
jiapingW merged 7 commits into
dataflow-up-11-m5-recoveryfrom
dataflow-up-13-disagg-example
Jun 29, 2026
Merged

[DataFlow runtime · M6 2/4] Disaggregated offline EAGLE3 example + build_disagg_eagle3_runtime#610
jiapingW merged 7 commits into
dataflow-up-11-m5-recoveryfrom
dataflow-up-13-disagg-example

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

Adds the disaggregated offline EAGLE3 example: build_disagg_eagle3_runtime + _assemble_offline_eagle3, ingest_offline_features + tensor-free ref manifest, producer/consumer runner. Bit-exact vs colocated.

Part of the DataFlow runtime M5/M6 stacked series (continues the M1–M4 work in #594#601 / #603). Stacked PRs — merge bottom-up (up-9 first). Lint (pre-commit) + runtime CPU test suite green.

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 and others added 7 commits June 28, 2026 20:41
Adds the consumer/producer assembly for the M6 disaggregation seam
(SharedDirFeatureStore), plus a runnable 2-node example:

- launch.py: build_disagg_eagle3_runtime (consumer side) + factor the shared
  offline trainer assembly out of build_offline_eagle3_runtime, so colocated and
  disaggregated paths produce byte-identical batches/training.
- data_plane/disagg_ingest.py: ingest_offline_features (producer: load .ckpt ->
  SharedDirFeatureStore.put) + JSON ref-manifest (the tensor-free metadata bridge
  between pools; asserts the no-tensor invariant).
- examples/disagg/: run_disagg_eagle3.py (role-branched producer/consumer driver),
  run_qwen2.5_7b_eagle3_disagg.sh (rcli --per-node wrapper), README.
- tests/test_runtime/test_disagg_launch.py: CPU bit-exact differential (disagg
  store serves identical tensors to the colocated path; manifest round-trips
  tensor-free; B9 auth) + a GPU FSDP train smoke.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The thin launchers skip sanity_check(); the train_eagle3 builders read
args.target_batch_size/dp_size which only sanity_check derives. Call it on the
consumer after init_distributed (it needs the process group). Also wire
chat-template/cache-dir/learning-rate into the rcli wrapper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Thread log_interval through build_offline/build_disagg_eagle3_runtime (default
50) so the example can emit a finer training curve; driver logs every 25 steps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Offline training re-iterates the ref set across epochs, but SharedDirFeatureStore
consume-once-frees on release() -> epoch 2 get() raised KeyError. Add
retain_on_release (read-only mode): release() drops the lease but keeps the file,
mirroring LocalFeatureStore's file:// no-op release. The disagg consumer sets it;
online rollout keeps consume-once (default False). Whole-store cleanup at run end.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
DISAGG_ROLE=colocated runs the SAME model build + assembly via
build_offline_eagle3_runtime (LocalFeatureStore), so disagg vs colocated can be
compared on identical features/seed. Factored the shared model/optimizer build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Disagg consumer vs colocated baseline on Qwen2.5-7B (identical features/seed):
training metrics (acceptance_rate/ploss/acc) match to ~5 sig figs; residual is
GPU floating-point noise, not the transport.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lint-only: formats the files this PR adds/changes; no behavior change. The shell
wrapper is marked executable (check-shebang-scripts-are-executable).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-13-disagg-example branch from e7bb5e1 to 31b823f Compare June 29, 2026 03:57
Base automatically changed from dataflow-up-12-m6-disagg to dataflow-up-11-m5-recovery June 29, 2026 15:56
@jiapingW jiapingW merged commit 2c8c66c into dataflow-up-11-m5-recovery Jun 29, 2026
1 check passed
@jiapingW jiapingW deleted the dataflow-up-13-disagg-example branch June 29, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants