Add a safety check before changing coordinators #2373

johscheuer · 2025-10-01T16:24:07Z

Description

Fix: #2246

Type of change

New feature (non-breaking change which adds functionality)

Discussion

Testing

CI will run e2e tests. I added some additional unit tests.

Documentation

Added docs for the new flags (in the flag help).

Follow-up

foundationdb-ci · 2025-10-01T19:54:12Z

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

Commit ID: 28c9a54
Duration 3:29:56
Result: ❌ FAILED
Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

johscheuer · 2025-10-02T04:40:50Z

setup/setup.go

 		math.MaxFloat64,
 		"Defines the threshold when a process will be considered to have a high run loop busy value. The value will be between 0.0 and 1.0. Setting it to a higher value will disable the high run loop busy check.",
 	)
+	fs.DurationVar(


For our e2e tests we probably want to reduce the minimum uptimes a bit.

foundationdb-ci · 2025-10-02T06:53:31Z

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

Commit ID: 28c9a54
Duration 2:12:59
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

nicmorales9

premise seems good but there are some seemingly-important TODOs left and I have some understanding-questions

nicmorales9 · 2025-10-08T12:42:36Z

controllers/change_coordinators_test.go

+
+				When("the cluster is up for long enough", func() {
+					It("should change the coordinators", func() {
+						Expect(requeue).To(BeNil())


does this guarantee it is changing the coordinators? feels like there should be a more explicit check, esp wrt time passed

nicmorales9 · 2025-10-08T13:07:30Z

controllers/cluster_controller.go

+	// changes because of a missing coordinator are allowed.
+	MinimumUptimeForCoordinatorChangeWithMissingProcess time.Duration
+	// MinimumUptimeForCoordinatorChangeWithUndesiredProcess defines the minimum uptime of the cluster before coordinator
+	// changes because of an undesired coordinator are allowed.


Suggested change

// changes because of an undesired coordinator are allowed.

// changes because of an undesired (excluded) coordinator are allowed.

Is there a different meaning of undesired other than excluded here? If not, feels like we should just use "excluded"

nicmorales9 · 2025-10-08T13:08:07Z

pkg/fdbstatus/status_checks.go

+	minimumUptimeForExcluded time.Duration,
+	recoveryStateEnabled bool,
+) error {
+	// TODO double check setting here + true


nicmorales9 · 2025-10-08T13:12:39Z

pkg/fdbstatus/status_checks.go

+		requiredUptime = minimumUptimeForExcluded.Seconds()
+		reason = "cluster is not up for long enough"
+
+		// Perform the default safet checks in case of "normal" coordinator changes or if processes are exclude. If


Suggested change

// Perform the default safet checks in case of "normal" coordinator changes or if processes are exclude. If

// Perform the default safety checks in case of "normal" coordinator changes or if processes are excluded. If

nicmorales9 · 2025-10-08T13:22:16Z

pkg/fdbstatus/status_checks.go

+	}
+
+	// Check that the cluster has been stable for the required time
+	if currentMinimumUptime < requiredUptime {


edge case kinda but could this get thrown off by a crashlooping coordinator?
The case where a coordinator is missing but will come up, I think the currentMinimumUptime should be time since recovery ( at least) and hopefully it should come up before requiredUptime, but if the coordinator (or something in tx) keeps crashing before requiredUptime couldn't we get stuck here? Alternatively, if there is something wrong with storage servers and one is crashing couldn't that also get us stuck here?
Or are these just not likely scenarios to happen for FDB?

Add a safety check before changing coordinators

28c9a54

johscheuer requested a review from nicmorales9 October 1, 2025 16:32

johscheuer marked this pull request as ready for review October 1, 2025 16:32

johscheuer closed this Oct 2, 2025

johscheuer reopened this Oct 2, 2025

johscheuer commented Oct 2, 2025

View reviewed changes

nicmorales9 reviewed Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a safety check before changing coordinators #2373

Add a safety check before changing coordinators #2373

Uh oh!

johscheuer commented Oct 1, 2025

Uh oh!

foundationdb-ci commented Oct 1, 2025

Uh oh!

johscheuer Oct 2, 2025

Uh oh!

foundationdb-ci commented Oct 2, 2025

Uh oh!

nicmorales9 left a comment

Uh oh!

nicmorales9 Oct 8, 2025

Uh oh!

nicmorales9 Oct 8, 2025

Uh oh!

nicmorales9 Oct 8, 2025

Uh oh!

nicmorales9 Oct 8, 2025

Uh oh!

nicmorales9 Oct 8, 2025

Uh oh!

Uh oh!

	// changes because of an undesired coordinator are allowed.
	// changes because of an undesired (excluded) coordinator are allowed.

	// Perform the default safet checks in case of "normal" coordinator changes or if processes are exclude. If
	// Perform the default safety checks in case of "normal" coordinator changes or if processes are excluded. If

Add a safety check before changing coordinators #2373

Are you sure you want to change the base?

Add a safety check before changing coordinators #2373

Uh oh!

Conversation

johscheuer commented Oct 1, 2025

Description

Type of change

Discussion

Testing

Documentation

Follow-up

Uh oh!

foundationdb-ci commented Oct 1, 2025

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

Uh oh!

johscheuer Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

Uh oh!

nicmorales9 left a comment

Choose a reason for hiding this comment

Uh oh!

nicmorales9 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

nicmorales9 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

nicmorales9 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

nicmorales9 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

nicmorales9 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!