feat(runtime): disaggregated offline EAGLE3 assemble example + 2-node 7B e2e#16
feat(runtime): disaggregated offline EAGLE3 assemble example + 2-node 7B e2e#16maocheng23 wants to merge 7 commits into
Conversation
ea6eff8 to
c129b0f
Compare
ae65a48 to
549c301
Compare
c129b0f to
ce67e57
Compare
Adds the consumer/producer assembly for the M6 disaggregation seam (SharedDirFeatureStore), plus a runnable 2-node example: - launch.py: build_disagg_eagle3_runtime (consumer side) + factor the shared offline trainer assembly out of build_offline_eagle3_runtime, so colocated and disaggregated paths produce byte-identical batches/training. - data_plane/disagg_ingest.py: ingest_offline_features (producer: load .ckpt -> SharedDirFeatureStore.put) + JSON ref-manifest (the tensor-free metadata bridge between pools; asserts the no-tensor invariant). - examples/disagg/: run_disagg_eagle3.py (role-branched producer/consumer driver), run_qwen2.5_7b_eagle3_disagg.sh (rcli --per-node wrapper), README. - tests/test_runtime/test_disagg_launch.py: CPU bit-exact differential (disagg store serves identical tensors to the colocated path; manifest round-trips tensor-free; B9 auth) + a GPU FSDP train smoke. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The thin launchers skip sanity_check(); the train_eagle3 builders read args.target_batch_size/dp_size which only sanity_check derives. Call it on the consumer after init_distributed (it needs the process group). Also wire chat-template/cache-dir/learning-rate into the rcli wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Thread log_interval through build_offline/build_disagg_eagle3_runtime (default 50) so the example can emit a finer training curve; driver logs every 25 steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Offline training re-iterates the ref set across epochs, but SharedDirFeatureStore consume-once-frees on release() -> epoch 2 get() raised KeyError. Add retain_on_release (read-only mode): release() drops the lease but keeps the file, mirroring LocalFeatureStore's file:// no-op release. The disagg consumer sets it; online rollout keeps consume-once (default False). Whole-store cleanup at run end. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
DISAGG_ROLE=colocated runs the SAME model build + assembly via build_offline_eagle3_runtime (LocalFeatureStore), so disagg vs colocated can be compared on identical features/seed. Factored the shared model/optimizer build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Disagg consumer vs colocated baseline on Qwen2.5-7B (identical features/seed): training metrics (acceptance_rate/ploss/acc) match to ~5 sig figs; residual is GPU floating-point noise, not the transport. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lint-only: formats the files this PR adds/changes; no behavior change. The shell wrapper is marked executable (check-shebang-scripts-are-executable). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
549c301 to
e7bb5e1
Compare
|
Mooncake integration: The current How SGLang uses Mooncake (the canonical integration) SGLang's Mooncake integration (
Cost of the current approach
For the offline one-shot ingestion use case this may be acceptable today, but it will not scale to larger feature sets or an online hot path. Suggested refactoring Replace the serialize-then-put pattern with Mooncake's zero-copy pointer API. The Producer side:
Consumer side:
Lifecycle:
This aligns with SGLang's proven pattern and would make the Mooncake backend a genuine upgrade over |
1 similar comment
|
Mooncake integration: The current How SGLang uses Mooncake (the canonical integration) SGLang's Mooncake integration (
Cost of the current approach
For the offline one-shot ingestion use case this may be acceptable today, but it will not scale to larger feature sets or an online hot path. Suggested refactoring Replace the serialize-then-put pattern with Mooncake's zero-copy pointer API. The Producer side:
Consumer side:
Lifecycle:
This aligns with SGLang's proven pattern and would make the Mooncake backend a genuine upgrade over |
|
@Boreas618 Good call, and we have done this zero-copy part upstream. |
|
Superseded: upstreamed + merged as sgl-project#610 (in up-11-m5-recovery). Closing this fork-internal PR. |
Builds the assemble example for the M6 disaggregation seam (
SharedDirFeatureStore, from #12): runs offline EAGLE3 training with the rollout/feature pool and the training pool on different GPU nodes sharing only a filesystem mount. The control plane carries only tensor-freeSampleRefmetadata; feature tensors travel through the shared store. Disaggregation changes where features live, not their values — so results match the colocated path.What's added
launch.py—build_disagg_eagle3_runtime(consumer side). Factored the shared trainer/loader assembly out ofbuild_offline_eagle3_runtimeinto_assemble_offline_eagle3, so colocated and disaggregated paths are byte-identical by construction. Added alog_intervalknob to both builders.data_plane/disagg_ingest.py—ingest_offline_features(producer: load.ckpt→SharedDirFeatureStore.put) + a JSON ref-manifest (write/read_ref_manifest) as the tensor-free metadata bridge between pools (asserts the no-tensor invariant).data_plane/disaggregated.py—retain_on_release(read-only mode). Offline training re-iterates the ref set across epochs; consume-once free deleted files mid-run (→ epoch-2KeyErrorand corrupted epoch-1 data). Retain mode keeps files (mirrorsLocalFeatureStore'sfile://no-op release); online rollout keeps consume-once (default).examples/disagg/—run_disagg_eagle3.py(role-branched producer/consumer),run_qwen2.5_7b_eagle3_disagg.sh(rcli--per-nodewrapper), README.tests/test_runtime/test_disagg_launch.py— CPU bit-exact differential + GPU FSDP train smoke.Validation — alignment proven three ways
SharedDirFeatureStoreserves byte-identical tensors to the colocatedLocalFeatureStorefile://path; the ref manifest round-trips carrying no tensors. (33 data-plane tests green.)build_disagg_eagle3_runtimetrains end-to-end through FSDP (31 runtime tests green on H200)./workspacemount: producer ingests 64 features → shared store; consumer trains. Run completes EXIT=0, crosses epoch boundaries (retain fix), ploss starts ~5.4 (≈ colocated baseline ~5.5), and acc 0.027→0.083 / acceptance 0.0013→0.034 climb over training (baseline direction). Per-step values are noisy atbatch_size=1.🤖 Generated with Claude Code