Skip to content

[DataFlow runtime] Two-axis staleness gate for online disaggregated training#615

Closed
maocheng23 wants to merge 1 commit into
dataflow-up-17-online-disaggfrom
dataflow-up-18-staleness
Closed

[DataFlow runtime] Two-axis staleness gate for online disaggregated training#615
maocheng23 wants to merge 1 commit into
dataflow-up-17-online-disaggfrom
dataflow-up-18-staleness

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

Stacks on the online-disagg PR (dataflow-up-17-online-disagg) → zero-copy (#614) → #612. Merge bottom-up. Third PR of the series. No hot-switch.

Why

Online training keeps producing fresh draft weights while the rollout pool keeps generating samples. Without hot-swapping the rollout (deliberately out of scope), the pool drifts behind and its samples go stale — training on stale features hurts. So the consumer must drop stale refs before they reach the trainer.

What

The staleness half of the weight lifecycle (no hot-update, no rollback, no serving accept-length gate):

  • contracts.WeightVersion (+ WeightStatus) — metadata-only published draft version; publish order is the staleness axis.
  • control_plane/version_policy.py
    • WeightRegistry — publish-ordered history; draft_lag(sample) = distance from the newest published version (None = unknown → maximally stale). Optional durable metadata_store backing.
    • StalenessPolicytwo axes: draft (max_draft_lag) + target (require_target_match) → accept/reasons assessment.
    • DriftMonitor — rolling mean/max draft lag; drifting() for the orchestrator (pause / force-sync / alarm).
    • StalenessGatedQueue — wraps a ref queue so the loader only ever sees fresh refs; stale ones are acked (backpressure clears) and their features aborted (freed) instead of trained on.
  • build_disagg_online_consumer gains optional staleness_policy / weight_registry / current_target_version / drift_monitor. When set, the streamed queue is gated so a drifted rollout can't poison training. Off by default (every streamed ref is trained, unchanged).

Tests

11 CPU tests: registry lag ordering + idempotent publish; both policy axes; unknown-version-is-maximally-stale; drift snapshot/threshold; and the gate (fresh passes, draft-stale + target-stale dropped → features aborted + acked).

Not in scope (deliberate)

Hot-update / weight hot-swap, rollback, and the serving accept-length gate. This PR only decides which streamed samples are fresh enough to train on.

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maocheng23 maocheng23 force-pushed the dataflow-up-17-online-disagg branch from 57da206 to d2c0382 Compare June 29, 2026 03:57
@maocheng23 maocheng23 force-pushed the dataflow-up-18-staleness branch from 4732ae7 to 0ebd0b5 Compare June 29, 2026 03:57
…raining

Online training keeps producing fresh draft weights while the rollout pool keeps
generating samples; without hot-swapping the rollout (out of scope -- no
hot-update), the pool drifts behind and its samples go stale. Training on stale
features hurts, so the consumer must drop them. This adds the staleness half of
the weight lifecycle (no hot-update / rollback / serving accept-length gate):

* contracts: WeightVersion (+ WeightStatus) -- metadata-only published draft
  version; publish order is the staleness axis.
* control_plane/version_policy.py:
  - WeightRegistry: publish-ordered history; draft_lag(sample) = distance from
    the newest published version (None = unknown -> maximally stale). Optional
    durable metadata_store backing.
  - StalenessPolicy: two axes -- draft (max_draft_lag) + target
    (require_target_match) -- returning an accept/reasons assessment.
  - DriftMonitor: rolling mean/max draft lag; drifting() for the orchestrator.
  - StalenessGatedQueue: wraps a ref queue so the loader only sees fresh refs;
    stale ones are acked (backpressure clears) and their features aborted (freed)
    instead of trained on.
* build_disagg_online_consumer gains optional staleness_policy / weight_registry
  / current_target_version / drift_monitor: when set, the streamed queue is gated
  so a drifted rollout can't poison training. Off by default (every ref trained).

11 CPU tests (registry lag, both policy axes, unknown-version, drift, gate
drop+abort+passthrough).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-17-online-disagg branch from d2c0382 to 7cd5452 Compare June 29, 2026 20:21
@maocheng23 maocheng23 force-pushed the dataflow-up-18-staleness branch from 0ebd0b5 to 670a3c2 Compare June 29, 2026 20:21
@maocheng23

Copy link
Copy Markdown
Collaborator Author

Closing — recreated cleanly on the rebased online stack (up-17/up-18 rebased onto the current zero-copy #621). Superseded by the new up-18 staleness PR below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant