[DataFlow runtime · M6] MooncakeFeatureStore zero-copy transport (put_from/get_into)#621
Closed
maocheng23 wants to merge 3 commits into
Closed
Conversation
Replace the torch.save/torch.load pickle round-trip with Mooncake's native
raw-buffer DMA. One hard-pinned object per *tensor*, keyed
{store_id}/{sid}/g{gen}/{name}: put() writes each tensor straight from its
storage via put_from(ptr); get() reads each straight into a tensor allocated
from the ref's FeatureSpec via get_into(ptr). Shape/dtype travel on the ref, so
there is no serialized header. The generation lives in the key (like
SharedDirFeatureStore's filename gen), so a re-put supersedes the old key set
and a stale ref's keys are gone -> get() raises (B5). Hard-pin is preserved
(put_from carries the ReplicateConfig). Falls back to the pickle blob path when
the backend lacks put_from/get_into (zero_copy=False or an older mooncake).
Source + receive buffers are registered with the transfer engine around the DMA:
RDMA rejects an unregistered address (AddressNotRegistered, -800); TCP ignores
the registration. Validated cross-node on 2x H200 over BOTH tcp and rdma --
producer put_from on node 0, consumer get_into + FSDP train on node 1.
Tests: the fake now simulates put_from/get_into via ctypes, so the full contract
runs on the zero-copy path (28 tests, real gated test green on a live master).
Adds zero-copy specifics (per-tensor keys, no-pickle-on-the-wire, re-put
supersede, pickle fallback) + cross-process contract tests (producer/consumer as
separate instances over one backend) + abort-under-lease-defer and
re-put-remove-failure edge cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…or_keys _store_get_tensor only checked get_into's return for a negative error code, so a short read (0 <= rc < nb) into the freshly torch.empty'd receive buffer was accepted, handing the trainer a tensor with an uninitialized garbage tail. Unlike the pickle path (torch.load reconstructs whole tensors), the raw-buffer path cannot otherwise detect under-fill. get_into returns the bytes read (a full read == nb), so require rc == nb and raise KeyError on a short read. Adds a zero-copy regression test that truncates the stored blob. Also remove the dead _tensor_keys() helper: it has no call sites, while _try_physical_free and _sample_exists each inline the same per-tensor-key comprehension. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This was referenced Jun 29, 2026
feat(runtime): disaggregated offline EAGLE3 assemble example + 2-node 7B e2e
maocheng23/SpecForge#16
Closed
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Zero-copy transport for
MooncakeFeatureStore: replaces torch.save/load pickle with native DMA — one hard-pinned object per tensor (key{store_id}/{sid}/g{gen}/{name});put()→put_from(ptr, nbytes, ReplicateConfig),get()→alloc from the ref's FeatureSpec +get_into(ptr, nbytes); shape/dtype travel on the ref. Generation-in-key preserves B5. Falls back to pickle if the backend lacksput_from/get_into(capability checked viacallable(getattr(...))).put_fromneeds the source bufferregister_buffer'd andget_intothe dst, or you getAddressNotRegistered (-800)/ probe-600. TCP doesn't need it. Both ends are registered around the DMA.get_intoshort reads; dropped dead_tensor_keys.Stacked on #608 (
dataflow-up-11-m5-recovery); merge after it. Supersedes #614, which auto-closed when its base branchdataflow-up-15-mooncakewas deleted as the disagg stack merged.🤖 Generated with Claude Code