Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f776a4e
runtime-m5(3/4): durable recovery — SQLiteMetadataStore + restart rec…
maocheng23 Jun 19, 2026
b82f5be
runtime-m5(4/4): fault-injection suite + sustained-lag soak (exit gates)
maocheng23 Jun 19, 2026
0148be5
style: apply pre-commit (black) to M5 durable-recovery files
maocheng23 Jun 27, 2026
549bacd
runtime-m6(1/2): disaggregation seam + resharding contract + auth (B5…
maocheng23 Jun 19, 2026
b3d3997
runtime-m6(2/2): >=4-rank tp>1 & sp>1 equivalence test (GPU gate)
maocheng23 Jun 19, 2026
ce67e57
style: apply pre-commit (black/isort) to M6 disagg files
maocheng23 Jun 27, 2026
df9cdc3
runtime: readability — split SampleRefQueue lease into _lease_any/_le…
maocheng23 Jun 29, 2026
06cb678
runtime(m6): per-generation-file SharedDirFeatureStore + gen-aware re…
maocheng23 Jun 29, 2026
39bc5ee
feat(runtime): disaggregated offline EAGLE3 assemble example
maocheng23 Jun 26, 2026
a5388d4
fix(disagg example): call sanity_check so builders see target_batch_size
maocheng23 Jun 26, 2026
430aba8
feat(disagg example): expose log_interval on the launchers
maocheng23 Jun 26, 2026
5930198
fix(disagg): retain_on_release mode for re-iterable offline feature sets
maocheng23 Jun 26, 2026
f5dd82c
feat(disagg example): add colocated role for head-to-head comparison
maocheng23 Jun 26, 2026
31431db
docs(disagg example): record head-to-head accept-length vs colocated
maocheng23 Jun 26, 2026
31b823f
style: apply pre-commit (black/isort/autoflake) to disagg example files
maocheng23 Jun 26, 2026
7008a2e
feat(runtime/m6): MooncakeFeatureStore — RDMA fast-path backend for d…
maocheng23 Jun 27, 2026
adcda7f
feat(disagg example): env-selectable Mooncake backend (DISAGG_BACKEND…
maocheng23 Jun 27, 2026
1a4d838
feat(disagg example): size Mooncake segment via MOONCAKE_GLOBAL_SEGME…
maocheng23 Jun 27, 2026
3852c3b
fix(mooncake): immediate logical free (tombstone) for B5 under lease-…
maocheng23 Jun 27, 2026
3a2cc7f
runtime: M5/M6 hardening — LocalFeatureStore gc/release, SQLite durab…
maocheng23 Jun 28, 2026
1a3278b
style: apply pre-commit (black/isort) to MooncakeFeatureStore files
maocheng23 Jun 27, 2026
e8363b1
Merge pull request #609 from sgl-project/dataflow-up-12-m6-disagg
jiapingW Jun 29, 2026
2c8c66c
Merge pull request #610 from sgl-project/dataflow-up-13-disagg-example
jiapingW Jun 29, 2026
3af3ec8
Merge pull request #611 from sgl-project/dataflow-up-14-hardening
jiapingW Jun 29, 2026
65e9323
Merge pull request #612 from sgl-project/dataflow-up-15-mooncake
jiapingW Jun 29, 2026
08b6186
docs(plan): fix draft-model naming, dedupe drafts layout, rm scratch …
maocheng23 Jun 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
243 changes: 243 additions & 0 deletions docs/concepts/dataflow_refactor_workflow.drawio

Large diffs are not rendered by default.

722 changes: 722 additions & 0 deletions docs/concepts/dataflow_refactor_workflow.md

Large diffs are not rendered by default.

101 changes: 101 additions & 0 deletions examples/disagg/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Disaggregated offline EAGLE3 example

Runs the offline EAGLE3 training of `scripts/train_eagle3_dataflow.py`, but splits
it across **two pools that share only a filesystem mount** — the M6 disaggregation
seam (`SharedDirFeatureStore`). It is the runnable proof that *disaggregation
changes where features live, not their values*: the training curve matches the
colocated offline run.

## How it works

```
producer pool (node 0) shared mount training pool (node 1)
───────────────────── ────────────── ──────────────────────
ingest_offline_features() ──put()──▶ SharedDirFeatureStore ──get()──▶ FeatureDataLoader
write_ref_manifest() ──json──▶ refs.json (no tensors) ──read──▶ build_disagg_eagle3_runtime
TrainerController.fit()
```

The control plane carries only tensor-free `SampleRef` metadata (the manifest);
feature tensors travel through the shared store. `build_disagg_eagle3_runtime`
reuses the exact offline trainer assembly, so results align by construction.

## Backends

The feature transport is selected by `DISAGG_BACKEND` (default `shared_dir`):

| backend | store | shared *data* mount? |
|---|---|---|
| `shared_dir` (default) | `SharedDirFeatureStore` (`torch.save` on a POSIX mount) | required |
| `mooncake` | `MooncakeFeatureStore` (RDMA/TCP network object store) | not needed |

`mooncake` is the M6 **fast path**: producer `put()`s and consumer `get()`s by key
across nodes peer-to-peer, so feature tensors need no shared *data* mount (only the
small ref manifest still uses `DISAGG_MANIFEST`). Each object is hard-pinned so
Mooncake's cache LRU never drops a committed feature. Because a Mooncake object
lives in the **producer's** memory segment, the producer must stay alive until the
consumer finishes — the example holds it open until the consumer writes
`<manifest>.consumed` (or `DISAGG_PRODUCER_HOLD_S` elapses). Enable with:

```bash
export DISAGG_BACKEND=mooncake
export MOONCAKE_LOCAL_HOSTNAME=<this-node-ip>
export MOONCAKE_METADATA_SERVER=<metadata url>
export MOONCAKE_MASTER_SERVER_ADDR=<master host:port>
export MOONCAKE_PROTOCOL=tcp # or rdma
```

Requires the `mooncake` package and a running Mooncake master/metadata service
(verify on a Mooncake-enabled GPU host). The contract itself is unit-tested
backend-agnostically in `tests/test_runtime/test_mooncake_store.py`.

## Run it (rcli, 2 nodes)

1. Generate offline features on node 0 (any EAGLE3 feature generator), e.g. into
`/root/disagg/features` as `*.ckpt` with keys
`input_ids,loss_mask,hidden_state,aux_hidden_state`.
2. Drive both pools at once — node 0 ingests, node 1 trains:

```bash
rcli exec --per-node <job> 'bash examples/disagg/run_qwen2.5_7b_eagle3_disagg.sh'
```

The wrapper branches on `RCLI_NODE_RANK`. Override paths/steps via env
(`DISAGG_STORE_ROOT`, `FEATURES_DIR`, `MAX_STEPS`, `NPROC`, …). Both pools must
share `DISAGG_STORE_ROOT`/`DISAGG_STORE_ID` and (if set) `DISAGG_AUTH_TOKEN`
(B9 auth).

## Single-host smoke

`DISAGG_ROLE` overrides the rank-derived role, so you can run both halves on one
host sharing a local dir — run the producer once, then the consumer:

```bash
DISAGG_ROLE=producer python examples/disagg/run_disagg_eagle3.py <args>
DISAGG_ROLE=consumer torchrun --standalone --nproc_per_node 1 \
examples/disagg/run_disagg_eagle3.py <args>
```

The bit-exact equivalence to the colocated path is covered by
`tests/test_runtime/test_disagg_launch.py`.

## Head-to-head vs colocated (Qwen2.5-7B, 2-node H200)

`DISAGG_ROLE=colocated` runs the same model build + assembly through
`build_offline_eagle3_runtime` (`LocalFeatureStore`). On identical features/seed,
the disaggregated consumer and the colocated baseline produce the same training
metrics to ~5 significant figures (residual ~1e-6–1e-8 is GPU run-to-run
floating-point noise, not the transport — feature tensors are byte-identical):

| step | metric | disagg | colocated |
|---|---|---|---|
| 20 | acceptance_rate | 0.0013300 | 0.0013300 |
| 20 | ploss | 5.386736 | 5.386740 |
| 20 | acc | 0.0272590 | 0.0272590 |
| 120 | acceptance_rate | 0.0223610 | 0.0223505 |
| 180 | acceptance_rate | 0.0337013 | 0.0336982 |

acc / acceptance_rate climb over training in both (baseline direction). Per-step
values are noisy at `batch_size=1` over 64 diverse samples. Note this is the
training-time acceptance proxy; the serving accept-length (τ via spec-decoding) is
a separate eval gate.
Loading
Loading