Skip to content

Conversation

johscheuer
Copy link
Member

Description

Fix: #2246

Type of change

  • New feature (non-breaking change which adds functionality)

Discussion

Testing

CI will run e2e tests. I added some additional unit tests.

Documentation

Added docs for the new flags (in the flag help).

Follow-up

@johscheuer johscheuer requested a review from nicmorales9 October 1, 2025 16:32
@johscheuer johscheuer marked this pull request as ready for review October 1, 2025 16:32
@foundationdb-ci
Copy link
Contributor

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

  • Commit ID: 28c9a54
  • Duration 3:29:56
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Oct 2, 2025
@johscheuer johscheuer reopened this Oct 2, 2025
math.MaxFloat64,
"Defines the threshold when a process will be considered to have a high run loop busy value. The value will be between 0.0 and 1.0. Setting it to a higher value will disable the high run loop busy check.",
)
fs.DurationVar(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our e2e tests we probably want to reduce the minimum uptimes a bit.

@foundationdb-ci
Copy link
Contributor

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

  • Commit ID: 28c9a54
  • Duration 2:12:59
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Contributor

@nicmorales9 nicmorales9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

premise seems good but there are some seemingly-important TODOs left and I have some understanding-questions


When("the cluster is up for long enough", func() {
It("should change the coordinators", func() {
Expect(requeue).To(BeNil())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this guarantee it is changing the coordinators? feels like there should be a more explicit check, esp wrt time passed

// changes because of a missing coordinator are allowed.
MinimumUptimeForCoordinatorChangeWithMissingProcess time.Duration
// MinimumUptimeForCoordinatorChangeWithUndesiredProcess defines the minimum uptime of the cluster before coordinator
// changes because of an undesired coordinator are allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// changes because of an undesired coordinator are allowed.
// changes because of an undesired (excluded) coordinator are allowed.

Is there a different meaning of undesired other than excluded here? If not, feels like we should just use "excluded"

minimumUptimeForExcluded time.Duration,
recoveryStateEnabled bool,
) error {
// TODO double check setting here + true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

requiredUptime = minimumUptimeForExcluded.Seconds()
reason = "cluster is not up for long enough"

// Perform the default safet checks in case of "normal" coordinator changes or if processes are exclude. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Perform the default safet checks in case of "normal" coordinator changes or if processes are exclude. If
// Perform the default safety checks in case of "normal" coordinator changes or if processes are excluded. If

}

// Check that the cluster has been stable for the required time
if currentMinimumUptime < requiredUptime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edge case kinda but could this get thrown off by a crashlooping coordinator?
The case where a coordinator is missing but will come up, I think the currentMinimumUptime should be time since recovery ( at least) and hopefully it should come up before requiredUptime, but if the coordinator (or something in tx) keeps crashing before requiredUptime couldn't we get stuck here? Alternatively, if there is something wrong with storage servers and one is crashing couldn't that also get us stuck here?
Or are these just not likely scenarios to happen for FDB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change coordinators should have a safety check
3 participants