feat(runtime/m6): MooncakeFeatureStore — RDMA fast-path for disaggregation#17
Closed
maocheng23 wants to merge 5 commits into
Closed
feat(runtime/m6): MooncakeFeatureStore — RDMA fast-path for disaggregation#17maocheng23 wants to merge 5 commits into
maocheng23 wants to merge 5 commits into
Conversation
549c301 to
e7bb5e1
Compare
60aee1c to
888c8d7
Compare
…isaggregation Adds the M6 fast-path FeatureStore backend named by disaggregated.py: backs the data plane with the Mooncake distributed object store (cross-node RDMA zero-copy) behind the unchanged FeatureStore API, so producer put() on one node and consumer get() on another move bytes peer-to-peer instead of via a shared mount. - One hard-pinned Mooncake object per sample (torch.save blob with embedded generation) so Mooncake's cache-LRU never silently evicts a committed-unacked feature; SpecForge is the sole lifetime authority via explicit remove(). - Carries the contract: B5 no-use-after-free (KeyError after release/abort, generation guard on re-put, clone-on-fetch), B9 shared-secret auth, consume-once free, retain_on_release for re-iterable offline epochs, max_resident_bytes backpressure, and max-hold gc. - Re-instates the fallible-free retry seam SharedDirFeatureStore dropped: a failed remote remove() parks in _release_pending and gc() retries (Mooncake remove is a real RPC). Generation-aware lease accounting (a stale lease never pins the current generation). - store= is injectable; mooncake is imported lazily, so the data plane imports without the package and the contract is unit-tested against an in-memory fake. Scope: offline single-consumer path (in-process gen/lease index, mirroring SharedDirFeatureStore's documented single-host limitation). Online multi-node needs a shared metadata index — separate follow-up. Real RDMA e2e test gated on the mooncake package; 15 fake-backed contract tests run on CPU. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…=mooncake) The disagg example can now route features through MooncakeFeatureStore instead of the shared mount: _store() builds the backend from DISAGG_BACKEND (default shared_dir) + MOONCAKE_* env. Because a Mooncake object lives in the producer's memory segment, the producer holds open until the consumer writes <manifest>.consumed (or DISAGG_PRODUCER_HOLD_S elapses); shared_dir is unchanged. Documents the backend switch + caveats in the README and the .sh wrapper (opt-in, commented). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…NT_SIZE The default 1 GiB segment only fits tiny feature sets; expose MOONCAKE_GLOBAL_SEGMENT_SIZE / MOONCAKE_LOCAL_BUFFER_SIZE so the contributed store memory can hold the hard-pinned feature set for a real run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…deferred remove Mooncake's remove() is lease-deferred (objects keep a short read-lease), so the bytes can linger after release/abort and a just-read object's is_exist() stays 1 within the lease window — breaking the B5 'get after release raises' contract that SharedDir/Local guarantee synchronously. Track (sample_id, generation) freed in-process and reject get() of a freed ref immediately, while physical reclamation stays lease-deferred / gc-retried. Adds a lease-defer regression test (fake remove() reports success but keeps the object). Found via real 2-node Mooncake e2e on sci-h200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
888c8d7 to
6b3e016
Compare
Owner
Author
|
Superseded: upstreamed + merged as sgl-project#612 (mooncake, in up-11); zero-copy follow-up is sgl-project#621. Closing this fork-internal PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the M6 fast-path FeatureStore backend that
disaggregated.pynames but didn't implement:MooncakeFeatureStorebacks the data plane with the Mooncake distributed object store (cross-node RDMA/TCP zero-copy) behind the unchanged FeatureStore API. Producerput()s on one node and consumerget()s on another, peer-to-peer — no shared data mount.Stacked on #16 (kept separate to keep that PR small); base will retarget to
mainonce #16 lands.Contents
data_plane/mooncake_store.py—MooncakeFeatureStore: one hard-pinned object per sample (so Mooncake's cache-LRU never silently evicts a committed-unacked feature; SpecForge is the sole lifetime authority via explicitremove()). Carries the contract — B5 no-use-after-free (KeyError after release/abort, generation guard, clone-on-fetch), B9 auth, consume-once free,retain_on_release,max_resident_bytesbackpressure, max-hold gc — and re-instates the fallible-free retry seam (_release_pending+ gc retry) that SharedDir dropped, since remoteremove()is a real RPC.mooncakeis imported lazily and the store is injectable, so the data plane imports without the package.tests/test_runtime/test_mooncake_store.py— 15 CPU contract tests against an in-memory fake (incl. equivalence vs LocalFeatureStore and the release-pending retry path) + a real-Mooncake e2e gated on the package +MOONCAKE_*env.examples/disagg/run_disagg_eagle3.pyselects the backend viaDISAGG_BACKEND(defaultshared_dir); the producer holds its segment open until the consumer signals done (Mooncake objects live in the producer's segment). README +.shdocument it.Scope / follow-ups
🤖 Generated with Claude Code