Skip to content

fix(tonic-xds): consume immediate first tick in OD housekeeping actor#2689

Merged
gu0keno0 merged 2 commits into
grpc:masterfrom
LYZJU2019:lyzju2019/fix-od-actor-first-tick
Jun 16, 2026
Merged

fix(tonic-xds): consume immediate first tick in OD housekeeping actor#2689
gu0keno0 merged 2 commits into
grpc:masterfrom
LYZJU2019:lyzju2019/fix-od-actor-first-tick

Conversation

@LYZJU2019

@LYZJU2019 LYZJU2019 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

A flaky test was reported (link) after #2619 was merged.

The root cause is: tokio::time::interval(period) fires immediately on its first poll, so the outlier-detection housekeeping actor was running run_housekeeping at t≈0 — before the data path had recorded anything. That early sweep calls snapshot_and_reset and zeros whatever counters have landed. Therefore, test_outlier_detection_reinsert_preserves_state flakes with (0, 0) != (3, 0) when the actor's first tick races the assertion read.

Fix: consume the immediate tick before entering the loop so the first real sweep fires one full interval after spawn, matching the gRFC A50 semantics.

Testing

`cargo test -p tonic-xds --lib test_outlier_detection_reinsert_preserves_state` — 10/10 pass under stress.

`tokio::time::interval(period)` returns immediately on its first poll, so
the actor was running `run_housekeeping` at t≈0 — before the data path had
recorded anything. That early sweep called `snapshot_and_reset` on whatever
counters had landed, zeroing them.

In tests this surfaces as a race: `test_outlier_detection_reinsert_preserves_state`
flakes with `(0, 0) != (3, 0)` when the actor's first tick fires after the
test records its 3 calls but before the assertion reads counters.

In production this silently drops the first interval's worth of outcomes
on every endpoint.

Fix: consume the immediate tick before entering the loop so the first
real sweep happens one full `interval` after spawn, matching what the
gRFC A50 sweep semantics expect.

Verified the test passes 10/10 under stress.
@gu0keno0

Copy link
Copy Markdown
Contributor

Thanks for the fix. maybe https://docs.rs/tokio/latest/tokio/time/fn.interval_at.html is a better way?

@gu0keno0 gu0keno0 merged commit 75ce9ff into grpc:master Jun 16, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants