[DataFlow runtime · online] Online disaggregated training (StreamingRefChannel + build_disagg_online_*) by maocheng23 · Pull Request #622 · sgl-project/SpecForge

maocheng23 · 2026-06-29T20:22:05Z

Online disaggregated training on top of the Mooncake zero-copy store (#621). Adds StreamingRefChannel/StreamingRefQueue (cross-process append-only tensor-free SampleRef stream with backpressure + EOF) and build_disagg_online_producer/build_disagg_online_consumer in launch.py. Cross-pool consume-once free works via shared Mooncake remove(); from SampleRef down the trainer path == colocated online. NO hot-switch.

Validated 2-node real-Mooncake online e2e over RDMA (rollout pool streams refs → trainer pool trains FSDP steps cross-node, consume-once). Rebased onto current #621.

Stacked: #608 → #621 → this → up-18 staleness. Supersedes the fork-only PR maocheng23#18.

🤖 Generated with Claude Code

…ocess online ref stream The offline disagg path hands the consumer a STATIC ref manifest written once. Online disaggregation needs a continuous stream: the rollout producer commits SampleRefs while the trainer consumes them, on another node. StreamingRefChannel is that control-plane channel: * tensor-free append-only JSONL (asserts no-tensor on publish); feature tensors go through the FeatureStore (Mooncake), so no shared *data* mount is needed. * poll() tail-reads complete lines from the last offset, buffering a partial trailing line so a half-written record is never parsed. * mark_consumed()/consumed_remote() give the producer a cross-process backpressure signal (in_flight_remote) with no shared in-process state. * close() drops an EOF sentinel so stream() terminates once drained; idle_timeout_s guards against a dead producer. Filesystem-backed (any shared control mount); a networked control plane slots in behind the same publish/poll API later. 7 CPU tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…consumer} + StreamingRefQueue Wires online disaggregated training: a rollout producer pool streams features to a trainer pool on a different node, tensors over Mooncake, refs over the StreamingRefChannel. * StreamingRefQueue: adapts the channel to the SampleRefQueue protocol (get/ack/fail) the FeatureDataLoader consumes. get() blocks until refs are available or the channel is closed-and-drained; ack() advances the channel's consumed counter (the producer's backpressure signal). * build_disagg_online_producer: RolloutWorker(s) (HF/SGLang target via SGLangAdapter) put() consume-once features into a Mooncake store and publish refs to the channel. drive_producer() runs until the prompt pool drains, pausing while in_flight_remote() exceeds a high-watermark so a lagging trainer can't overrun the segment, then closes the channel. * build_disagg_online_consumer: the online trainer assembly (target_head=None) reading refs from a StreamingRefQueue + tensors from a consume-once Mooncake store. The loader frees each sample on read (get -> release -> remote remove). The cross-pool consume-once free works through the shared Mooncake remove (proven in the mooncake cross-process tests); from SampleRef down the trainer path is identical to colocated online. Tests: test_disagg_online (CPU integration -- stream -> loader -> consume-once free, backpressure, blocks-until-close) + test_disagg_online_launch (GPU -- producer streams, consumer trains through FSDP end to end). StreamingRefQueue covered via the integration test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist · 2026-06-29T20:22:09Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 and others added 2 commits June 29, 2026 13:20

maocheng23 requested a review from FrankLeeeee as a code owner June 29, 2026 20:22

maocheng23 mentioned this pull request Jun 29, 2026

[DataFlow runtime · online] O1.1 — shared cross-process control plane #624

Merged

jiapingW self-requested a review June 30, 2026 13:26

jiapingW approved these changes Jun 30, 2026

View reviewed changes

jiapingW merged commit 8adec57 into dataflow-up-16-zerocopy Jun 30, 2026
1 check passed

jiapingW deleted the dataflow-up-17-online-disagg branch June 30, 2026 13:29

This was referenced Jul 4, 2026

Merge DataFlow runtime branch into main #648

Open

[DataFlow runtime] Phase D — training managers (no_sync, full resume, checkpoint/eval) #637

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataFlow runtime · online] Online disaggregated training (StreamingRefChannel + build_disagg_online_*)#622

[DataFlow runtime · online] Online disaggregated training (StreamingRefChannel + build_disagg_online_*)#622
jiapingW merged 2 commits into
dataflow-up-16-zerocopyfrom
dataflow-up-17-online-disagg

maocheng23 commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maocheng23 commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants