Skip to content

[DataFlow runtime · M6] MooncakeFeatureStore zero-copy transport (put_from/get_into)#621

Closed
maocheng23 wants to merge 3 commits into
sgl-project:dataflow-up-11-m5-recoveryfrom
maocheng23:dataflow-up-16-zerocopy
Closed

[DataFlow runtime · M6] MooncakeFeatureStore zero-copy transport (put_from/get_into)#621
maocheng23 wants to merge 3 commits into
sgl-project:dataflow-up-11-m5-recoveryfrom
maocheng23:dataflow-up-16-zerocopy

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

Zero-copy transport for MooncakeFeatureStore: replaces torch.save/load pickle with native DMA — one hard-pinned object per tensor (key {store_id}/{sid}/g{gen}/{name}); put()put_from(ptr, nbytes, ReplicateConfig), get()→alloc from the ref's FeatureSpec + get_into(ptr, nbytes); shape/dtype travel on the ref. Generation-in-key preserves B5. Falls back to pickle if the backend lacks put_from/get_into (capability checked via callable(getattr(...))).

  • RDMA gotcha (found + fixed on H200): put_from needs the source buffer register_buffer'd and get_into the dst, or you get AddressNotRegistered (-800) / probe -600. TCP doesn't need it. Both ends are registered around the DMA.
  • Rejects zero-copy get_into short reads; dropped dead _tensor_keys.
  • Validated 2×H200 cross-node over both TCP and RDMA; 28 unit tests green.

Stacked on #608 (dataflow-up-11-m5-recovery); merge after it. Supersedes #614, which auto-closed when its base branch dataflow-up-15-mooncake was deleted as the disagg stack merged.

🤖 Generated with Claude Code

maocheng23 and others added 2 commits June 28, 2026 20:41
Replace the torch.save/torch.load pickle round-trip with Mooncake's native
raw-buffer DMA. One hard-pinned object per *tensor*, keyed
{store_id}/{sid}/g{gen}/{name}: put() writes each tensor straight from its
storage via put_from(ptr); get() reads each straight into a tensor allocated
from the ref's FeatureSpec via get_into(ptr). Shape/dtype travel on the ref, so
there is no serialized header. The generation lives in the key (like
SharedDirFeatureStore's filename gen), so a re-put supersedes the old key set
and a stale ref's keys are gone -> get() raises (B5). Hard-pin is preserved
(put_from carries the ReplicateConfig). Falls back to the pickle blob path when
the backend lacks put_from/get_into (zero_copy=False or an older mooncake).

Source + receive buffers are registered with the transfer engine around the DMA:
RDMA rejects an unregistered address (AddressNotRegistered, -800); TCP ignores
the registration. Validated cross-node on 2x H200 over BOTH tcp and rdma --
producer put_from on node 0, consumer get_into + FSDP train on node 1.

Tests: the fake now simulates put_from/get_into via ctypes, so the full contract
runs on the zero-copy path (28 tests, real gated test green on a live master).
Adds zero-copy specifics (per-tensor keys, no-pickle-on-the-wire, re-put
supersede, pickle fallback) + cross-process contract tests (producer/consumer as
separate instances over one backend) + abort-under-lease-defer and
re-put-remove-failure edge cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…or_keys

_store_get_tensor only checked get_into's return for a negative error code, so
a short read (0 <= rc < nb) into the freshly torch.empty'd receive buffer was
accepted, handing the trainer a tensor with an uninitialized garbage tail.
Unlike the pickle path (torch.load reconstructs whole tensors), the raw-buffer
path cannot otherwise detect under-fill. get_into returns the bytes read (a
full read == nb), so require rc == nb and raise KeyError on a short read. Adds
a zero-copy regression test that truncates the stored blob.

Also remove the dead _tensor_keys() helper: it has no call sites, while
_try_physical_free and _sample_exists each inline the same per-tensor-key
comprehension.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 requested a review from FrankLeeeee as a code owner June 29, 2026 19:28
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jiapingW jiapingW deleted the branch sgl-project:dataflow-up-11-m5-recovery June 30, 2026 11:15
@jiapingW jiapingW closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants