[DataFlow runtime] Two-axis staleness gate for online disaggregated training by maocheng23 · Pull Request #615 · sgl-project/SpecForge

maocheng23 · 2026-06-29T00:01:51Z

Stacks on the online-disagg PR (dataflow-up-17-online-disagg) → zero-copy (#614) → #612. Merge bottom-up. Third PR of the series. No hot-switch.

Why

Online training keeps producing fresh draft weights while the rollout pool keeps generating samples. Without hot-swapping the rollout (deliberately out of scope), the pool drifts behind and its samples go stale — training on stale features hurts. So the consumer must drop stale refs before they reach the trainer.

What

The staleness half of the weight lifecycle (no hot-update, no rollback, no serving accept-length gate):

contracts.WeightVersion (+ WeightStatus) — metadata-only published draft version; publish order is the staleness axis.
control_plane/version_policy.py
- WeightRegistry — publish-ordered history; draft_lag(sample) = distance from the newest published version (None = unknown → maximally stale). Optional durable metadata_store backing.
- StalenessPolicy — two axes: draft (max_draft_lag) + target (require_target_match) → accept/reasons assessment.
- DriftMonitor — rolling mean/max draft lag; drifting() for the orchestrator (pause / force-sync / alarm).
- StalenessGatedQueue — wraps a ref queue so the loader only ever sees fresh refs; stale ones are acked (backpressure clears) and their features aborted (freed) instead of trained on.
build_disagg_online_consumer gains optional staleness_policy / weight_registry / current_target_version / drift_monitor. When set, the streamed queue is gated so a drifted rollout can't poison training. Off by default (every streamed ref is trained, unchanged).

Tests

11 CPU tests: registry lag ordering + idempotent publish; both policy axes; unknown-version-is-maximally-stale; drift snapshot/threshold; and the gate (fresh passes, draft-stale + target-stale dropped → features aborted + acked).

Not in scope (deliberate)

Hot-update / weight hot-swap, rollback, and the serving accept-length gate. This PR only decides which streamed samples are fresh enough to train on.

🤖 Generated with Claude Code

gemini-code-assist · 2026-06-29T00:01:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…raining Online training keeps producing fresh draft weights while the rollout pool keeps generating samples; without hot-swapping the rollout (out of scope -- no hot-update), the pool drifts behind and its samples go stale. Training on stale features hurts, so the consumer must drop them. This adds the staleness half of the weight lifecycle (no hot-update / rollback / serving accept-length gate): * contracts: WeightVersion (+ WeightStatus) -- metadata-only published draft version; publish order is the staleness axis. * control_plane/version_policy.py: - WeightRegistry: publish-ordered history; draft_lag(sample) = distance from the newest published version (None = unknown -> maximally stale). Optional durable metadata_store backing. - StalenessPolicy: two axes -- draft (max_draft_lag) + target (require_target_match) -- returning an accept/reasons assessment. - DriftMonitor: rolling mean/max draft lag; drifting() for the orchestrator. - StalenessGatedQueue: wraps a ref queue so the loader only sees fresh refs; stale ones are acked (backpressure clears) and their features aborted (freed) instead of trained on. * build_disagg_online_consumer gains optional staleness_policy / weight_registry / current_target_version / drift_monitor: when set, the streamed queue is gated so a drifted rollout can't poison training. Off by default (every ref trained). 11 CPU tests (registry lag, both policy axes, unknown-version, drift, gate drop+abort+passthrough). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 · 2026-06-29T20:22:02Z

Closing — recreated cleanly on the rebased online stack (up-17/up-18 rebased onto the current zero-copy #621). Superseded by the new up-18 staleness PR below.

maocheng23 force-pushed the dataflow-up-17-online-disagg branch from 57da206 to d2c0382 Compare June 29, 2026 03:57

maocheng23 force-pushed the dataflow-up-18-staleness branch from 4732ae7 to 0ebd0b5 Compare June 29, 2026 03:57

maocheng23 force-pushed the dataflow-up-17-online-disagg branch from d2c0382 to 7cd5452 Compare June 29, 2026 20:21

maocheng23 force-pushed the dataflow-up-18-staleness branch from 0ebd0b5 to 670a3c2 Compare June 29, 2026 20:21

maocheng23 closed this Jun 29, 2026

maocheng23 mentioned this pull request Jun 29, 2026

[DataFlow runtime · online] Two-axis staleness gate for online disaggregated training #623

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataFlow runtime] Two-axis staleness gate for online disaggregated training#615

[DataFlow runtime] Two-axis staleness gate for online disaggregated training#615
maocheng23 wants to merge 1 commit into
dataflow-up-17-online-disaggfrom
dataflow-up-18-staleness

maocheng23 commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot commented Jun 29, 2026

Uh oh!

maocheng23 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

maocheng23 commented Jun 29, 2026

Why

What

Tests

Not in scope (deliberate)

Uh oh!

gemini-code-assist Bot commented Jun 29, 2026

Uh oh!

maocheng23 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant