adapter/caught-up: ignore beyond_all_hope collections #33803

teskje · 2025-10-08T12:08:23Z

This PR restores the previous behavior where the caught-up checker ignores collections whose live frontier is more than two hours behind, to ensure we don't get stuck on broken sources or compute collections that for some reason cannot make progress without ooming their cluster.

Motivation

This PR fixes a recognized bug.

Fixes https://github.com/MaterializeInc/database-issues/issues/9721

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

teskje · 2025-10-08T12:09:53Z

src/adapter-types/src/dyncfgs.rs

    "with_0dt_caught_up_check_cutoff",
-    Duration::from_secs(10 * 60), // 10 minutes
-    "During a 0dt deployment, if a cluster has only 'problematic' (crash-looping) replicas _and_ any collection that is behind by more than this cutoff, the cluster will be ignored in caught-up checks.",
+    Duration::from_secs(2 * 60 * 60), // 2 hours


Changed this back to the old default, since cutting over too soon is worse than cutting over too late. A value between the two might work as well, but not sure how to choose one.

This commit restores the previous behavior where the caught-up checker ignores collections whose live frontier is more than two hours behind, to ensure we don't get stuck on broken sources or compute collections that for some reason cannot make progress without ooming their cluster.

ggevay

Thank you!

ggevay · 2025-10-08T12:59:06Z

Oh, I just remembered that the doc comments of clusters_caught_up and collections_caught_up need a little bit of updating:

the cluster_has_only_problematic_replicas thing is missing from it;
the collection_hydrated check is missing from it.

But it's also ok if this is considered out of scope for this PR.

teskje commented Oct 8, 2025

View reviewed changes

teskje force-pushed the caught-up-exclude-beyond_all_hope branch from 77c6421 to 7b72d8f Compare October 8, 2025 12:31

teskje marked this pull request as ready for review October 8, 2025 12:45

teskje requested a review from a team as a code owner October 8, 2025 12:45

teskje requested review from SangJunBak and ggevay October 8, 2025 12:45

ggevay approved these changes Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adapter/caught-up: ignore beyond_all_hope collections #33803

adapter/caught-up: ignore beyond_all_hope collections #33803

teskje commented Oct 8, 2025

Uh oh!

teskje Oct 8, 2025

Uh oh!

ggevay left a comment

Uh oh!

ggevay commented Oct 8, 2025

Uh oh!

Uh oh!

adapter/caught-up: ignore beyond_all_hope collections #33803

Are you sure you want to change the base?

adapter/caught-up: ignore beyond_all_hope collections #33803

Conversation

teskje commented Oct 8, 2025

Motivation

Checklist

Uh oh!

teskje Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

ggevay left a comment

Choose a reason for hiding this comment

Uh oh!

ggevay commented Oct 8, 2025

Uh oh!

Uh oh!