[DataFlow runtime · M6 1/4] Disaggregation seam: SharedDirFeatureStore + resharding + auth#609
Merged
Merged
Conversation
…/B9) Framework-first M6: build the seams a real Mooncake/RDMA backend slots behind, contract-tested on CPU now (no multi-node infra required). - SharedDirFeatureStore: a disaggregated FeatureStore over a shared directory. Producer and consumer are separate processes sharing only the dir; get() resolves from the ref + filesystem alone (true cross-process boundary), control plane still moves only SampleRef metadata. A real MooncakeFeatureStore swaps the shared-dir transport for RDMA behind this same API. - B5 (no use-after-free): get() after release/abort raises; a generation guard rejects a stale ref after re-put; clone-on-fetch is the default. - B9 (auth in disaggregated mode): AuthPolicy shared-secret gate at attach time and on the data path; missing/mismatched token is a PermissionError. - Resharding contract: SampleRefQueue.get(partition=(index, num)) re-partitions a stable committed pool by a consumer-side hash of sample_id, so the same pool redistributes when DP width changes — no sample leased twice or dropped. Numerical resharding equivalence (tp>1 & sp>1, >=4 ranks) is the GPU gate, added next. Real RDMA Mooncake backend + cross-node deploy need infra not available here; the seam + contract are locked down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spawns a 4-process tp2 x sp2 group; on each rank runs one offline EAGLE3 step through both the legacy path and the new TrainerCore/strategy/FSDP-backend path on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm reduction parity. This is the falsifiable scale-out gate (not FSDP-only). Adds _fixtures.init_rank_distributed for multi-process TP x SP group setup. Runs on the 4xH200 pod via rcli. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…ase_partition helpers No behavior change: get() now reads as reclaim -> lease(any|partition) -> wait. Clarifies that partition_key (reserved producer-side hint) and partition (consumer-side resharding control) are unrelated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lease (B5)
Replace the {sid}.ckpt + .gen-sidecar / two-step-publish design with a
per-generation filename {sid}.g{gen}.ckpt, a single atomic publish, and a
generation-aware release()/_free_gen_locked so a stale handle can never free a
freshly re-put generation. Generation is derived from disk (monotonic across
instances) instead of a per-process counter. Includes retain_on_release
(offline re-iterable mode) and the stale-reput cross-instance regression test.
This folds the disagg-correctness hardening down into the seam PR so #609 ships
a correct SharedDirFeatureStore rather than one fixed two PRs later.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
maocheng23
added a commit
that referenced
this pull request
Jun 29, 2026
…ility, reconcile gate, USP 4-rank test - LocalFeatureStore: gc() counts freed_bytes on a successful release-pending retry; release() sid hoist + comment; get() lock-scope comment. - SQLiteMetadataStore: synchronous=FULL; all_committed_ids ORDER BY rowid. - controller.reconcile_on_restart gates release on optimizer_durable (not global_step is not None). - test_equiv_4rank uses attention_backend='usp' + a flash-attn skip guard. - regression tests in test_feature_store / test_recovery. (The disaggregated.py per-generation rewrite that was previously bundled here now lives in #609 alongside the SharedDirFeatureStore it hardens.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jiapingW
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the M6 disaggregation seam: SharedDirFeatureStore + AuthPolicy (B9) + SampleRefQueue consumer-side resharding (
dp_partition).B5 (no use-after-free) via a per-generation-file design: each generation is a distinct
{sample_id}.g{gen}.ckptpublished with a single atomic rename; a re-putremoves superseded generations (so a stale ref's file is gone and itsget()raises rather than aliasing fresh data);release()is generation-aware (frees only the generation its lease held); clone-on-fetch is the default. Generation is derived from disk, so re-putis monotonic across instances.retain_on_releasesupports offline re-iterable feature sets.Resharding contract: partitioning is a consumer-side hash of
sample_id(dp_partition), so the same committed pool re-distributes when DP width changes — no sample leased twice or dropped.Scope: this is the CPU-testable reference backend that pins the contract; the RDMA fast path is
MooncakeFeatureStore(#612). The generation/lease index is in-process (single-host); true multi-node liveness lifts it into a shared metadata store (later milestone). The per-generation B5 correctness was folded in here (from the M6 hardening pass) so the seam lands correct rather than fixed two PRs later.Part of the DataFlow runtime M5/M6 stacked series (continues the M1–M4 work in #594–#601 / #603). Stacked PRs — merge bottom-up (up-9 first). Lint (pre-commit) + runtime CPU test suite green.
🤖 Generated with Claude Code