[DataFlow runtime · online M-O1.1] SQLiteSampleRefQueue: durable cross-process queue#19
Closed
maocheng23 wants to merge 1 commit into
Closed
Conversation
…line O1.1) The in-process SampleRefQueue couples producer and consumer to one OS process (threading.Condition over in-memory dicts), so a disaggregated *run* (rollout pool and trainer pool as separate processes) cannot share lease/ack state. Adds SQLiteSampleRefQueue: a drop-in for DataFlowController(sample_queue=...) that keeps the exact put/get/ack/fail/depth/in_flight contract but holds pending/leased state in SQLite, so producer and consumer processes over one DB file see one queue. - Lease race closed by BEGIN IMMEDIATE around read-then-lease (WAL + busy_timeout let a second process wait out the writer lock) -- two consumers never double-lease. - Wall-clock leases (monotonic is not comparable across processes) + timeout reclaim. - Partitioned (DP-resharding) lease is non-blocking while the global pool is non-empty, matching the in-process queue. - Metadata-only (assert_no_tensors on put), durable across reopen. No controller change: commit/lease/ack already route only through sample_queue + MetadataStore, so a shared SQLite queue + SQLiteMetadataStore makes the train-side control plane cross-process as-is (test_cross_process_commit_lease_ack_then_reconcile). First PR of the online disaggregated stack (refactor/dataflow-online-roadmap.md, O1.1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
Author
|
Superseded: duplicates the existing StreamingRefChannel/StreamingRefQueue on dataflow-up-17-online-disagg. Pivoting to review+harden the existing online-disagg/staleness branches instead. (Roadmap context: sgl-project#618.) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on sgl-project#614 (
dataflow-up-16-zerocopy). Merge bottom-up. First implementation PR of the online disaggregated stack planned in sgl-project#618 (refactor/dataflow-online-roadmap.md, milestone O1.1).What
The in-process
SampleRefQueuecouples producer and consumer to one OS process (athreading.Conditionover in-memory dicts). A disaggregated run puts the rollout pool and the trainer pool in separate processes, so pending/leased state must be shared across processes, not just threads.Adds
SQLiteSampleRefQueue— a drop-in forDataFlowController(sample_queue=...)that keeps the exactput/get/ack/fail/depth/in_flightcontract but holds the queue in SQLite, so a producer process and a consumer process over one DB file see one queue.BEGIN IMMEDIATEaround read-then-lease (WAL +busy_timeoutlet a second process wait out the writer lock) — two consumer processes never double-lease.assert_no_tensorsonput); durable across process restart.No controller change needed
commit/lease/ackalready route only throughsample_queue+MetadataStore, so a sharedSQLiteSampleRefQueue+SQLiteMetadataStoremakes the train-side control plane cross-process as-is — seetest_cross_process_commit_lease_ack_then_reconcile(two controllers, separate instances, share commit → lease → ack and reconcile from the shared durable marker).Tests
tests/test_runtime/test_durable_queue.py(11 tests, CPU, green locally via an isolated import of theruntimemodules): cross-instance commit/lease/ack, no-double-lease, partition disjoint/complete + reshard-remainder-once, non-blocking empty shard, fail retry/drop, lease-timeout reclaim, idempotent put, durable-across-reopen, and the cross-controller integration test.Not in this PR (next in the stack)
O1.2 engine-agnostic
FeatureSource+build_disagg_online_eagle3_runtime(async loop); O1.3 live SGLang capture (via the stable publicreturn_hidden_statesAPI — no version coupling); O2.x Ray actors / backpressure / streaming-pool lifetime.🤖 Generated with Claude Code