Skip to content

[DataFlow runtime · M6 4/4] MooncakeFeatureStore — RDMA fast-path backend#612

Merged
jiapingW merged 5 commits into
dataflow-up-11-m5-recoveryfrom
dataflow-up-15-mooncake
Jun 29, 2026
Merged

[DataFlow runtime · M6 4/4] MooncakeFeatureStore — RDMA fast-path backend#612
jiapingW merged 5 commits into
dataflow-up-11-m5-recoveryfrom
dataflow-up-15-mooncake

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

Adds MooncakeFeatureStore: backs the FeatureStore contract with the Mooncake distributed object store (cross-node RDMA/TCP) behind the unchanged API — hard-pin-on-put, lease-deferred-free tombstone for B5, fallible-free retry. Validated 2-node cross-node on H200. mooncake imported lazily; backend-selectable in the disagg example.

Part of the DataFlow runtime M5/M6 stacked series (continues #594#601 / #603). Stacked PRs — merge bottom-up (up-9 first). Lint (pre-commit) + runtime CPU test suite green.

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 and others added 5 commits June 28, 2026 20:41
…isaggregation

Adds the M6 fast-path FeatureStore backend named by disaggregated.py: backs the
data plane with the Mooncake distributed object store (cross-node RDMA zero-copy)
behind the unchanged FeatureStore API, so producer put() on one node and consumer
get() on another move bytes peer-to-peer instead of via a shared mount.

- One hard-pinned Mooncake object per sample (torch.save blob with embedded
  generation) so Mooncake's cache-LRU never silently evicts a committed-unacked
  feature; SpecForge is the sole lifetime authority via explicit remove().
- Carries the contract: B5 no-use-after-free (KeyError after release/abort,
  generation guard on re-put, clone-on-fetch), B9 shared-secret auth, consume-once
  free, retain_on_release for re-iterable offline epochs, max_resident_bytes
  backpressure, and max-hold gc.
- Re-instates the fallible-free retry seam SharedDirFeatureStore dropped: a failed
  remote remove() parks in _release_pending and gc() retries (Mooncake remove is a
  real RPC). Generation-aware lease accounting (a stale lease never pins the
  current generation).
- store= is injectable; mooncake is imported lazily, so the data plane imports
  without the package and the contract is unit-tested against an in-memory fake.

Scope: offline single-consumer path (in-process gen/lease index, mirroring
SharedDirFeatureStore's documented single-host limitation). Online multi-node
needs a shared metadata index — separate follow-up. Real RDMA e2e test gated on
the mooncake package; 15 fake-backed contract tests run on CPU.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…=mooncake)

The disagg example can now route features through MooncakeFeatureStore instead of
the shared mount: _store() builds the backend from DISAGG_BACKEND (default
shared_dir) + MOONCAKE_* env. Because a Mooncake object lives in the producer's
memory segment, the producer holds open until the consumer writes
<manifest>.consumed (or DISAGG_PRODUCER_HOLD_S elapses); shared_dir is unchanged.
Documents the backend switch + caveats in the README and the .sh wrapper (opt-in,
commented).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…NT_SIZE

The default 1 GiB segment only fits tiny feature sets; expose
MOONCAKE_GLOBAL_SEGMENT_SIZE / MOONCAKE_LOCAL_BUFFER_SIZE so the contributed
store memory can hold the hard-pinned feature set for a real run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…deferred remove

Mooncake's remove() is lease-deferred (objects keep a short read-lease), so the
bytes can linger after release/abort and a just-read object's is_exist() stays 1
within the lease window — breaking the B5 'get after release raises' contract
that SharedDir/Local guarantee synchronously. Track (sample_id, generation)
freed in-process and reject get() of a freed ref immediately, while physical
reclamation stays lease-deferred / gc-retried. Adds a lease-defer regression
test (fake remove() reports success but keeps the object).

Found via real 2-node Mooncake e2e on sci-h200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-14-hardening branch from c39507b to 3a2cc7f Compare June 29, 2026 03:57
@maocheng23 maocheng23 force-pushed the dataflow-up-15-mooncake branch from 6b3e016 to 1a3278b Compare June 29, 2026 03:57
Base automatically changed from dataflow-up-14-hardening to dataflow-up-11-m5-recovery June 29, 2026 16:05
@jiapingW jiapingW merged commit 65e9323 into dataflow-up-11-m5-recovery Jun 29, 2026
1 check passed
@jiapingW jiapingW deleted the dataflow-up-15-mooncake branch June 29, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants