Skip to content

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Oct 8, 2025

This PR restores the previous behavior where the caught-up checker ignores collections whose live frontier is more than two hours behind, to ensure we don't get stuck on broken sources or compute collections that for some reason cannot make progress without ooming their cluster.

Motivation

  • This PR fixes a recognized bug.

Fixes https://github.com/MaterializeInc/database-issues/issues/9721

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

"with_0dt_caught_up_check_cutoff",
Duration::from_secs(10 * 60), // 10 minutes
"During a 0dt deployment, if a cluster has only 'problematic' (crash-looping) replicas _and_ any collection that is behind by more than this cutoff, the cluster will be ignored in caught-up checks.",
Duration::from_secs(2 * 60 * 60), // 2 hours
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this back to the old default, since cutting over too soon is worse than cutting over too late. A value between the two might work as well, but not sure how to choose one.

This commit restores the previous behavior where the caught-up checker
ignores collections whose live frontier is more than two hours behind,
to ensure we don't get stuck on broken sources or compute collections
that for some reason cannot make progress without ooming their cluster.
@teskje teskje force-pushed the caught-up-exclude-beyond_all_hope branch from 77c6421 to 7b72d8f Compare October 8, 2025 12:31
@teskje teskje marked this pull request as ready for review October 8, 2025 12:45
@teskje teskje requested a review from a team as a code owner October 8, 2025 12:45
@teskje teskje requested review from SangJunBak and ggevay October 8, 2025 12:45
Copy link
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ggevay
Copy link
Contributor

ggevay commented Oct 8, 2025

Oh, I just remembered that the doc comments of clusters_caught_up and collections_caught_up need a little bit of updating:

  • the cluster_has_only_problematic_replicas thing is missing from it;
  • the collection_hydrated check is missing from it.

But it's also ok if this is considered out of scope for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants