Skip to content

[DataFlow runtime · online] Two-axis staleness gate for online disaggregated training#623

Closed
maocheng23 wants to merge 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-18-staleness
Closed

[DataFlow runtime · online] Two-axis staleness gate for online disaggregated training#623
maocheng23 wants to merge 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-18-staleness

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

Two-axis staleness for the online consumer so it drops features made under a stale draft/target version. Adds contracts.WeightVersion/WeightStatus + control_plane/version_policy.py (WeightRegistry publish/latest/draft_lag, StalenessPolicy two-axis, DriftMonitor, StalenessGatedQueue) + optional gate params on build_disagg_online_consumer (off by default). NO hot_update/rollback/accept-length.

Stacked on the online-disagg PR above. Recreated (rebased) from the closed #615.

🤖 Generated with Claude Code

…raining

Online training keeps producing fresh draft weights while the rollout pool keeps
generating samples; without hot-swapping the rollout (out of scope -- no
hot-update), the pool drifts behind and its samples go stale. Training on stale
features hurts, so the consumer must drop them. This adds the staleness half of
the weight lifecycle (no hot-update / rollback / serving accept-length gate):

* contracts: WeightVersion (+ WeightStatus) -- metadata-only published draft
  version; publish order is the staleness axis.
* control_plane/version_policy.py:
  - WeightRegistry: publish-ordered history; draft_lag(sample) = distance from
    the newest published version (None = unknown -> maximally stale). Optional
    durable metadata_store backing.
  - StalenessPolicy: two axes -- draft (max_draft_lag) + target
    (require_target_match) -- returning an accept/reasons assessment.
  - DriftMonitor: rolling mean/max draft lag; drifting() for the orchestrator.
  - StalenessGatedQueue: wraps a ref queue so the loader only sees fresh refs;
    stale ones are acked (backpressure clears) and their features aborted (freed)
    instead of trained on.
* build_disagg_online_consumer gains optional staleness_policy / weight_registry
  / current_target_version / drift_monitor: when set, the streamed queue is gated
  so a drifted rollout can't poison training. Off by default (every ref trained).

11 CPU tests (registry lag, both policy axes, unknown-version, drift, gate
drop+abort+passthrough).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maocheng23 maocheng23 requested a review from FrankLeeeee as a code owner June 29, 2026 20:22
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maocheng23 maocheng23 marked this pull request as draft June 29, 2026 21:53
Base automatically changed from dataflow-up-17-online-disagg to dataflow-up-16-zerocopy June 30, 2026 13:29
@maocheng23 maocheng23 closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant