Skip to content

feat(runtime/m6): MooncakeFeatureStore — RDMA fast-path for disaggregation#17

Closed
maocheng23 wants to merge 5 commits into
runtime-hardeningfrom
runtime-m6-mooncake-store
Closed

feat(runtime/m6): MooncakeFeatureStore — RDMA fast-path for disaggregation#17
maocheng23 wants to merge 5 commits into
runtime-hardeningfrom
runtime-m6-mooncake-store

Conversation

@maocheng23

Copy link
Copy Markdown
Owner

What

Adds the M6 fast-path FeatureStore backend that disaggregated.py names but didn't implement: MooncakeFeatureStore backs the data plane with the Mooncake distributed object store (cross-node RDMA/TCP zero-copy) behind the unchanged FeatureStore API. Producer put()s on one node and consumer get()s on another, peer-to-peer — no shared data mount.

Stacked on #16 (kept separate to keep that PR small); base will retarget to main once #16 lands.

Contents

  • data_plane/mooncake_store.pyMooncakeFeatureStore: one hard-pinned object per sample (so Mooncake's cache-LRU never silently evicts a committed-unacked feature; SpecForge is the sole lifetime authority via explicit remove()). Carries the contract — B5 no-use-after-free (KeyError after release/abort, generation guard, clone-on-fetch), B9 auth, consume-once free, retain_on_release, max_resident_bytes backpressure, max-hold gc — and re-instates the fallible-free retry seam (_release_pending + gc retry) that SharedDir dropped, since remote remove() is a real RPC. mooncake is imported lazily and the store is injectable, so the data plane imports without the package.
  • tests/test_runtime/test_mooncake_store.py — 15 CPU contract tests against an in-memory fake (incl. equivalence vs LocalFeatureStore and the release-pending retry path) + a real-Mooncake e2e gated on the package + MOONCAKE_* env.
  • Example wiring: examples/disagg/run_disagg_eagle3.py selects the backend via DISAGG_BACKEND (default shared_dir); the producer holds its segment open until the consumer signals done (Mooncake objects live in the producer's segment). README + .sh document it.

Scope / follow-ups

  • Correct for the offline single-consumer path M6 ships (in-process gen/lease index, mirroring SharedDir's documented single-host limitation). The shared lease/generation index for the online multi-node path is a separate follow-up (needs a networked metadata service — not SQLite).
  • Real RDMA cross-node e2e is being validated on a Mooncake-enabled 2-node H200 pod; CPU contract tests + the equivalence gate pass locally (144 runtime tests).

🤖 Generated with Claude Code

@maocheng23 maocheng23 force-pushed the runtime-m6-disagg-example branch from 549c301 to e7bb5e1 Compare June 27, 2026 18:37
@maocheng23 maocheng23 force-pushed the runtime-m6-mooncake-store branch from 60aee1c to 888c8d7 Compare June 27, 2026 18:38
maocheng23 and others added 5 commits June 27, 2026 17:24
…isaggregation

Adds the M6 fast-path FeatureStore backend named by disaggregated.py: backs the
data plane with the Mooncake distributed object store (cross-node RDMA zero-copy)
behind the unchanged FeatureStore API, so producer put() on one node and consumer
get() on another move bytes peer-to-peer instead of via a shared mount.

- One hard-pinned Mooncake object per sample (torch.save blob with embedded
  generation) so Mooncake's cache-LRU never silently evicts a committed-unacked
  feature; SpecForge is the sole lifetime authority via explicit remove().
- Carries the contract: B5 no-use-after-free (KeyError after release/abort,
  generation guard on re-put, clone-on-fetch), B9 shared-secret auth, consume-once
  free, retain_on_release for re-iterable offline epochs, max_resident_bytes
  backpressure, and max-hold gc.
- Re-instates the fallible-free retry seam SharedDirFeatureStore dropped: a failed
  remote remove() parks in _release_pending and gc() retries (Mooncake remove is a
  real RPC). Generation-aware lease accounting (a stale lease never pins the
  current generation).
- store= is injectable; mooncake is imported lazily, so the data plane imports
  without the package and the contract is unit-tested against an in-memory fake.

Scope: offline single-consumer path (in-process gen/lease index, mirroring
SharedDirFeatureStore's documented single-host limitation). Online multi-node
needs a shared metadata index — separate follow-up. Real RDMA e2e test gated on
the mooncake package; 15 fake-backed contract tests run on CPU.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…=mooncake)

The disagg example can now route features through MooncakeFeatureStore instead of
the shared mount: _store() builds the backend from DISAGG_BACKEND (default
shared_dir) + MOONCAKE_* env. Because a Mooncake object lives in the producer's
memory segment, the producer holds open until the consumer writes
<manifest>.consumed (or DISAGG_PRODUCER_HOLD_S elapses); shared_dir is unchanged.
Documents the backend switch + caveats in the README and the .sh wrapper (opt-in,
commented).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…NT_SIZE

The default 1 GiB segment only fits tiny feature sets; expose
MOONCAKE_GLOBAL_SEGMENT_SIZE / MOONCAKE_LOCAL_BUFFER_SIZE so the contributed
store memory can hold the hard-pinned feature set for a real run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…deferred remove

Mooncake's remove() is lease-deferred (objects keep a short read-lease), so the
bytes can linger after release/abort and a just-read object's is_exist() stays 1
within the lease window — breaking the B5 'get after release raises' contract
that SharedDir/Local guarantee synchronously. Track (sample_id, generation)
freed in-process and reject get() of a freed ref immediately, while physical
reclamation stays lease-deferred / gc-retried. Adds a lease-defer regression
test (fake remove() reports success but keeps the object).

Found via real 2-node Mooncake e2e on sci-h200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the runtime-m6-mooncake-store branch from 888c8d7 to 6b3e016 Compare June 28, 2026 00:24
@maocheng23 maocheng23 changed the base branch from runtime-m6-disagg-example to runtime-hardening June 28, 2026 00:24
@maocheng23

Copy link
Copy Markdown
Owner Author

Superseded: upstreamed + merged as sgl-project#612 (mooncake, in up-11); zero-copy follow-up is sgl-project#621. Closing this fork-internal PR.

@maocheng23 maocheng23 closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant