runtime-m6 (4/6): disaggregation + resharding + auth (B5/B9) + 4-rank gate by maocheng23 · Pull Request #12 · maocheng23/SpecForge

maocheng23 · 2026-06-19T16:51:43Z

M6 · Disaggregation seam + resharding contract + security (B5/B9) + ≥4-rank gate

PR 4 of the M5–M7 stack. Framework-first M6: the seams a real Mooncake/RDMA backend slots behind, contract-tested on CPU now (no multi-node infra required), plus the falsifiable scale-out gate.

What's in this PR

SharedDirFeatureStore (data_plane/disaggregated.py) — a disaggregated FeatureStore over a shared directory. Producer (rollout) and consumer (trainer) run as separate processes that share only the directory; get() resolves a sample from the SampleRef + filesystem alone (a true cross-process boundary), and the control plane still moves only metadata. A real MooncakeFeatureStore swaps the shared-dir transport for RDMA behind this same API.
- B5 — no use-after-free: get() after release/abort raises; a generation guard rejects a stale ref after a re-put; clone-on-fetch is the default.
- B9 — auth in disaggregated mode: AuthPolicy shared-secret gate at attach time and on the data path; a missing/mismatched token is a PermissionError.
Resharding contract — SampleRefQueue.get(partition=(index, num)) + dp_partition(sample_id, num). Partitioning is a consumer-side decision over a stable key, so the same committed pool re-distributes cleanly when the DP width changes — no ref leased twice or dropped across a reshard.
≥4-rank tp>1 & sp>1 equivalence test (test_equiv_4rank.py) — spawns a 4-process tp2×sp2 group and, on each rank, runs one offline EAGLE3 step through both the legacy path and the new TrainerCore/strategy/FSDP-backend path on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm parity. The "scale-out claim met, not FSDP-only" gate.

Gate tests

tests/test_runtime/test_disaggregated.py (cross-process, B5 use-after-free, B9 auth), tests/test_runtime/test_resharding.py (consumer re-partitions a stable pool), tests/test_runtime/test_equiv_4rank.py (GPU ×4 + flash-attn v2).

Known constraints (documented in `specforge/runtime/README.md`)

The real RDMA MooncakeFeatureStore, cross-node deployment, and the head-to-head accept-length benchmark vs TorchSpec need infra/baselines not present here; the seams are built and contract-tested with the local disaggregated backend.
test_equiv_4rank.py needs a free 4-GPU node and flash-attn v2; the cached sglang:dev image ships flash-attn v4 (incompatible API), so it skips there and runs on a training image with v2.

Stack

Base: runtime-m5-recovery (PR 3). Note: the consumer-side stale-release/atomic-publish hardening of SharedDirFeatureStore and the usp-backend fix for the 4-rank test land in PR 6 (runtime-hardening).

🤖 Generated with Claude Code

…/B9) Framework-first M6: build the seams a real Mooncake/RDMA backend slots behind, contract-tested on CPU now (no multi-node infra required). - SharedDirFeatureStore: a disaggregated FeatureStore over a shared directory. Producer and consumer are separate processes sharing only the dir; get() resolves from the ref + filesystem alone (true cross-process boundary), control plane still moves only SampleRef metadata. A real MooncakeFeatureStore swaps the shared-dir transport for RDMA behind this same API. - B5 (no use-after-free): get() after release/abort raises; a generation guard rejects a stale ref after re-put; clone-on-fetch is the default. - B9 (auth in disaggregated mode): AuthPolicy shared-secret gate at attach time and on the data path; missing/mismatched token is a PermissionError. - Resharding contract: SampleRefQueue.get(partition=(index, num)) re-partitions a stable committed pool by a consumer-side hash of sample_id, so the same pool redistributes when DP width changes — no sample leased twice or dropped. Numerical resharding equivalence (tp>1 & sp>1, >=4 ranks) is the GPU gate, added next. Real RDMA Mooncake backend + cross-node deploy need infra not available here; the seam + contract are locked down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Spawns a 4-process tp2 x sp2 group; on each rank runs one offline EAGLE3 step through both the legacy path and the new TrainerCore/strategy/FSDP-backend path on identical USP-sharded data, asserting per-rank loss equivalence + grad-norm reduction parity. This is the falsifiable scale-out gate (not FSDP-only). Adds _fixtures.init_rank_distributed for multi-process TP x SP group setup. Runs on the 4xH200 pod via rcli. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 force-pushed the runtime-m5-recovery branch from dbb4557 to c9f29c4 Compare June 22, 2026 03:47

maocheng23 force-pushed the runtime-m6-disagg branch from cb4449c to ea6eff8 Compare June 22, 2026 03:47

maocheng23 mentioned this pull request Jun 26, 2026

feat(runtime): disaggregated offline EAGLE3 assemble example + 2-node 7B e2e #16

Closed

maocheng23 force-pushed the runtime-m5-recovery branch from c9f29c4 to b82f5be Compare June 27, 2026 03:55

maocheng23 force-pushed the runtime-m6-disagg branch from ea6eff8 to c129b0f Compare June 27, 2026 03:55

maocheng23 and others added 3 commits June 27, 2026 11:37

style: apply pre-commit (black/isort) to M6 disagg files

ce67e57

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 force-pushed the runtime-m6-disagg branch from c129b0f to ce67e57 Compare June 27, 2026 18:37

maocheng23 closed this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime-m6 (4/6): disaggregation + resharding + auth (B5/B9) + 4-rank gate#12

runtime-m6 (4/6): disaggregation + resharding + auth (B5/B9) + 4-rank gate#12
maocheng23 wants to merge 3 commits into
runtime-m5-recoveryfrom
runtime-m6-disagg

maocheng23 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maocheng23 commented Jun 19, 2026

M6 · Disaggregation seam + resharding contract + security (B5/B9) + ≥4-rank gate

What's in this PR

Gate tests

Known constraints (documented in specforge/runtime/README.md)

Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Known constraints (documented in `specforge/runtime/README.md`)