Skip to content

[DataFlow runtime · M6 1/4] Disaggregation seam: SharedDirFeatureStore + resharding + auth#609

Merged
jiapingW merged 5 commits into
dataflow-up-11-m5-recoveryfrom
dataflow-up-12-m6-disagg
Jun 29, 2026
Merged

[DataFlow runtime · M6 1/4] Disaggregation seam: SharedDirFeatureStore + resharding + auth#609
jiapingW merged 5 commits into
dataflow-up-11-m5-recoveryfrom
dataflow-up-12-m6-disagg

Conversation

@maocheng23

@maocheng23 maocheng23 commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Adds the M6 disaggregation seam: SharedDirFeatureStore + AuthPolicy (B9) + SampleRefQueue consumer-side resharding (dp_partition).

B5 (no use-after-free) via a per-generation-file design: each generation is a distinct {sample_id}.g{gen}.ckpt published with a single atomic rename; a re-put removes superseded generations (so a stale ref's file is gone and its get() raises rather than aliasing fresh data); release() is generation-aware (frees only the generation its lease held); clone-on-fetch is the default. Generation is derived from disk, so re-put is monotonic across instances. retain_on_release supports offline re-iterable feature sets.

Resharding contract: partitioning is a consumer-side hash of sample_id (dp_partition), so the same committed pool re-distributes when DP width changes — no sample leased twice or dropped.

Scope: this is the CPU-testable reference backend that pins the contract; the RDMA fast path is MooncakeFeatureStore (#612). The generation/lease index is in-process (single-host); true multi-node liveness lifts it into a shared metadata store (later milestone). The per-generation B5 correctness was folded in here (from the M6 hardening pass) so the seam lands correct rather than fixed two PRs later.

Part of the DataFlow runtime M5/M6 stacked series (continues the M1–M4 work in #594#601 / #603). Stacked PRs — merge bottom-up (up-9 first). Lint (pre-commit) + runtime CPU test suite green.

🤖 Generated with Claude Code

maocheng23 and others added 3 commits June 27, 2026 11:37
…/B9)

Framework-first M6: build the seams a real Mooncake/RDMA backend slots behind,
contract-tested on CPU now (no multi-node infra required).

- SharedDirFeatureStore: a disaggregated FeatureStore over a shared directory.
  Producer and consumer are separate processes sharing only the dir; get()
  resolves from the ref + filesystem alone (true cross-process boundary), control
  plane still moves only SampleRef metadata. A real MooncakeFeatureStore swaps
  the shared-dir transport for RDMA behind this same API.
- B5 (no use-after-free): get() after release/abort raises; a generation guard
  rejects a stale ref after re-put; clone-on-fetch is the default.
- B9 (auth in disaggregated mode): AuthPolicy shared-secret gate at attach time
  and on the data path; missing/mismatched token is a PermissionError.
- Resharding contract: SampleRefQueue.get(partition=(index, num)) re-partitions a
  stable committed pool by a consumer-side hash of sample_id, so the same pool
  redistributes when DP width changes — no sample leased twice or dropped.

Numerical resharding equivalence (tp>1 & sp>1, >=4 ranks) is the GPU gate, added
next. Real RDMA Mooncake backend + cross-node deploy need infra not available
here; the seam + contract are locked down.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spawns a 4-process tp2 x sp2 group; on each rank runs one offline EAGLE3 step
through both the legacy path and the new TrainerCore/strategy/FSDP-backend path
on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm
reduction parity. This is the falsifiable scale-out gate (not FSDP-only).

Adds _fixtures.init_rank_distributed for multi-process TP x SP group setup.
Runs on the 4xH200 pod via rcli.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 requested a review from FrankLeeeee as a code owner June 28, 2026 00:33
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 and others added 2 commits June 28, 2026 20:40
…ase_partition helpers

No behavior change: get() now reads as reclaim -> lease(any|partition) -> wait.
Clarifies that partition_key (reserved producer-side hint) and partition
(consumer-side resharding control) are unrelated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lease (B5)

Replace the {sid}.ckpt + .gen-sidecar / two-step-publish design with a
per-generation filename {sid}.g{gen}.ckpt, a single atomic publish, and a
generation-aware release()/_free_gen_locked so a stale handle can never free a
freshly re-put generation. Generation is derived from disk (monotonic across
instances) instead of a per-process counter. Includes retain_on_release
(offline re-iterable mode) and the stale-reput cross-instance regression test.

This folds the disagg-correctness hardening down into the seam PR so #609 ships
a correct SharedDirFeatureStore rather than one fixed two PRs later.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
maocheng23 added a commit that referenced this pull request Jun 29, 2026
…ility, reconcile gate, USP 4-rank test

- LocalFeatureStore: gc() counts freed_bytes on a successful release-pending
  retry; release() sid hoist + comment; get() lock-scope comment.
- SQLiteMetadataStore: synchronous=FULL; all_committed_ids ORDER BY rowid.
- controller.reconcile_on_restart gates release on optimizer_durable (not
  global_step is not None).
- test_equiv_4rank uses attention_backend='usp' + a flash-attn skip guard.
- regression tests in test_feature_store / test_recovery.

(The disaggregated.py per-generation rewrite that was previously bundled here now
lives in #609 alongside the SharedDirFeatureStore it hardens.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jiapingW jiapingW merged commit e8363b1 into dataflow-up-11-m5-recovery Jun 29, 2026
1 check passed
@jiapingW jiapingW deleted the dataflow-up-12-m6-disagg branch June 29, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants