Skip to content

runtime-m6 (4/6): disaggregation + resharding + auth (B5/B9) + 4-rank gate#12

Closed
maocheng23 wants to merge 3 commits into
runtime-m5-recoveryfrom
runtime-m6-disagg
Closed

runtime-m6 (4/6): disaggregation + resharding + auth (B5/B9) + 4-rank gate#12
maocheng23 wants to merge 3 commits into
runtime-m5-recoveryfrom
runtime-m6-disagg

Conversation

@maocheng23

Copy link
Copy Markdown
Owner

M6 · Disaggregation seam + resharding contract + security (B5/B9) + ≥4-rank gate

PR 4 of the M5–M7 stack. Framework-first M6: the seams a real Mooncake/RDMA backend slots behind, contract-tested on CPU now (no multi-node infra required), plus the falsifiable scale-out gate.

What's in this PR

  • SharedDirFeatureStore (data_plane/disaggregated.py) — a disaggregated FeatureStore over a shared directory. Producer (rollout) and consumer (trainer) run as separate processes that share only the directory; get() resolves a sample from the SampleRef + filesystem alone (a true cross-process boundary), and the control plane still moves only metadata. A real MooncakeFeatureStore swaps the shared-dir transport for RDMA behind this same API.
    • B5 — no use-after-free: get() after release/abort raises; a generation guard rejects a stale ref after a re-put; clone-on-fetch is the default.
    • B9 — auth in disaggregated mode: AuthPolicy shared-secret gate at attach time and on the data path; a missing/mismatched token is a PermissionError.
  • Resharding contractSampleRefQueue.get(partition=(index, num)) + dp_partition(sample_id, num). Partitioning is a consumer-side decision over a stable key, so the same committed pool re-distributes cleanly when the DP width changes — no ref leased twice or dropped across a reshard.
  • ≥4-rank tp>1 & sp>1 equivalence test (test_equiv_4rank.py) — spawns a 4-process tp2×sp2 group and, on each rank, runs one offline EAGLE3 step through both the legacy path and the new TrainerCore/strategy/FSDP-backend path on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm parity. The "scale-out claim met, not FSDP-only" gate.

Gate tests

  • tests/test_runtime/test_disaggregated.py (cross-process, B5 use-after-free, B9 auth), tests/test_runtime/test_resharding.py (consumer re-partitions a stable pool), tests/test_runtime/test_equiv_4rank.py (GPU ×4 + flash-attn v2).

Known constraints (documented in specforge/runtime/README.md)

  • The real RDMA MooncakeFeatureStore, cross-node deployment, and the head-to-head accept-length benchmark vs TorchSpec need infra/baselines not present here; the seams are built and contract-tested with the local disaggregated backend.
  • test_equiv_4rank.py needs a free 4-GPU node and flash-attn v2; the cached sglang:dev image ships flash-attn v4 (incompatible API), so it skips there and runs on a training image with v2.

Stack

Base: runtime-m5-recovery (PR 3). Note: the consumer-side stale-release/atomic-publish hardening of SharedDirFeatureStore and the usp-backend fix for the 4-rank test land in PR 6 (runtime-hardening).

🤖 Generated with Claude Code

maocheng23 and others added 3 commits June 27, 2026 11:37
…/B9)

Framework-first M6: build the seams a real Mooncake/RDMA backend slots behind,
contract-tested on CPU now (no multi-node infra required).

- SharedDirFeatureStore: a disaggregated FeatureStore over a shared directory.
  Producer and consumer are separate processes sharing only the dir; get()
  resolves from the ref + filesystem alone (true cross-process boundary), control
  plane still moves only SampleRef metadata. A real MooncakeFeatureStore swaps
  the shared-dir transport for RDMA behind this same API.
- B5 (no use-after-free): get() after release/abort raises; a generation guard
  rejects a stale ref after re-put; clone-on-fetch is the default.
- B9 (auth in disaggregated mode): AuthPolicy shared-secret gate at attach time
  and on the data path; missing/mismatched token is a PermissionError.
- Resharding contract: SampleRefQueue.get(partition=(index, num)) re-partitions a
  stable committed pool by a consumer-side hash of sample_id, so the same pool
  redistributes when DP width changes — no sample leased twice or dropped.

Numerical resharding equivalence (tp>1 & sp>1, >=4 ranks) is the GPU gate, added
next. Real RDMA Mooncake backend + cross-node deploy need infra not available
here; the seam + contract are locked down.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spawns a 4-process tp2 x sp2 group; on each rank runs one offline EAGLE3 step
through both the legacy path and the new TrainerCore/strategy/FSDP-backend path
on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm
reduction parity. This is the falsifiable scale-out gate (not FSDP-only).

Adds _fixtures.init_rank_distributed for multi-process TP x SP group setup.
Runs on the 4xH200 pod via rcli.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant