runtime-m6 (4/6): disaggregation + resharding + auth (B5/B9) + 4-rank gate#12
Closed
maocheng23 wants to merge 3 commits into
Closed
runtime-m6 (4/6): disaggregation + resharding + auth (B5/B9) + 4-rank gate#12maocheng23 wants to merge 3 commits into
maocheng23 wants to merge 3 commits into
Conversation
dbb4557 to
c9f29c4
Compare
cb4449c to
ea6eff8
Compare
c9f29c4 to
b82f5be
Compare
ea6eff8 to
c129b0f
Compare
…/B9) Framework-first M6: build the seams a real Mooncake/RDMA backend slots behind, contract-tested on CPU now (no multi-node infra required). - SharedDirFeatureStore: a disaggregated FeatureStore over a shared directory. Producer and consumer are separate processes sharing only the dir; get() resolves from the ref + filesystem alone (true cross-process boundary), control plane still moves only SampleRef metadata. A real MooncakeFeatureStore swaps the shared-dir transport for RDMA behind this same API. - B5 (no use-after-free): get() after release/abort raises; a generation guard rejects a stale ref after re-put; clone-on-fetch is the default. - B9 (auth in disaggregated mode): AuthPolicy shared-secret gate at attach time and on the data path; missing/mismatched token is a PermissionError. - Resharding contract: SampleRefQueue.get(partition=(index, num)) re-partitions a stable committed pool by a consumer-side hash of sample_id, so the same pool redistributes when DP width changes — no sample leased twice or dropped. Numerical resharding equivalence (tp>1 & sp>1, >=4 ranks) is the GPU gate, added next. Real RDMA Mooncake backend + cross-node deploy need infra not available here; the seam + contract are locked down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spawns a 4-process tp2 x sp2 group; on each rank runs one offline EAGLE3 step through both the legacy path and the new TrainerCore/strategy/FSDP-backend path on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm reduction parity. This is the falsifiable scale-out gate (not FSDP-only). Adds _fixtures.init_rank_distributed for multi-process TP x SP group setup. Runs on the 4xH200 pod via rcli. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
c129b0f to
ce67e57
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
M6 · Disaggregation seam + resharding contract + security (B5/B9) + ≥4-rank gate
PR 4 of the M5–M7 stack. Framework-first M6: the seams a real Mooncake/RDMA backend slots behind, contract-tested on CPU now (no multi-node infra required), plus the falsifiable scale-out gate.
What's in this PR
SharedDirFeatureStore(data_plane/disaggregated.py) — a disaggregatedFeatureStoreover a shared directory. Producer (rollout) and consumer (trainer) run as separate processes that share only the directory;get()resolves a sample from theSampleRef+ filesystem alone (a true cross-process boundary), and the control plane still moves only metadata. A realMooncakeFeatureStoreswaps the shared-dir transport for RDMA behind this same API.get()afterrelease/abortraises; a generation guard rejects a stale ref after a re-put; clone-on-fetch is the default.AuthPolicyshared-secret gate at attach time and on the data path; a missing/mismatched token is aPermissionError.SampleRefQueue.get(partition=(index, num))+dp_partition(sample_id, num). Partitioning is a consumer-side decision over a stable key, so the same committed pool re-distributes cleanly when the DP width changes — no ref leased twice or dropped across a reshard.test_equiv_4rank.py) — spawns a 4-process tp2×sp2 group and, on each rank, runs one offline EAGLE3 step through both the legacy path and the newTrainerCore/strategy/FSDP-backend path on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm parity. The "scale-out claim met, not FSDP-only" gate.Gate tests
tests/test_runtime/test_disaggregated.py(cross-process, B5 use-after-free, B9 auth),tests/test_runtime/test_resharding.py(consumer re-partitions a stable pool),tests/test_runtime/test_equiv_4rank.py(GPU ×4 + flash-attn v2).Known constraints (documented in
specforge/runtime/README.md)MooncakeFeatureStore, cross-node deployment, and the head-to-head accept-length benchmark vs TorchSpec need infra/baselines not present here; the seams are built and contract-tested with the local disaggregated backend.test_equiv_4rank.pyneeds a free 4-GPU node and flash-attn v2; the cachedsglang:devimage ships flash-attn v4 (incompatible API), so it skips there and runs on a training image with v2.Stack
Base:
runtime-m5-recovery(PR 3). Note: the consumer-side stale-release/atomic-publish hardening ofSharedDirFeatureStoreand theusp-backend fix for the 4-rank test land in PR 6 (runtime-hardening).🤖 Generated with Claude Code