-
Notifications
You must be signed in to change notification settings - Fork 103
Add a safety check before changing coordinators #2373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add a safety check before changing coordinators #2373
Conversation
Result of fdb-kubernetes-operator-pr on Linux RHEL 9
|
math.MaxFloat64, | ||
"Defines the threshold when a process will be considered to have a high run loop busy value. The value will be between 0.0 and 1.0. Setting it to a higher value will disable the high run loop busy check.", | ||
) | ||
fs.DurationVar( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our e2e tests we probably want to reduce the minimum uptimes a bit.
Result of fdb-kubernetes-operator-pr on Linux RHEL 9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
premise seems good but there are some seemingly-important TODOs left and I have some understanding-questions
|
||
When("the cluster is up for long enough", func() { | ||
It("should change the coordinators", func() { | ||
Expect(requeue).To(BeNil()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this guarantee it is changing the coordinators? feels like there should be a more explicit check, esp wrt time passed
// changes because of a missing coordinator are allowed. | ||
MinimumUptimeForCoordinatorChangeWithMissingProcess time.Duration | ||
// MinimumUptimeForCoordinatorChangeWithUndesiredProcess defines the minimum uptime of the cluster before coordinator | ||
// changes because of an undesired coordinator are allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// changes because of an undesired coordinator are allowed. | |
// changes because of an undesired (excluded) coordinator are allowed. |
Is there a different meaning of undesired other than excluded here? If not, feels like we should just use "excluded"
minimumUptimeForExcluded time.Duration, | ||
recoveryStateEnabled bool, | ||
) error { | ||
// TODO double check setting here + true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
requiredUptime = minimumUptimeForExcluded.Seconds() | ||
reason = "cluster is not up for long enough" | ||
|
||
// Perform the default safet checks in case of "normal" coordinator changes or if processes are exclude. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Perform the default safet checks in case of "normal" coordinator changes or if processes are exclude. If | |
// Perform the default safety checks in case of "normal" coordinator changes or if processes are excluded. If |
} | ||
|
||
// Check that the cluster has been stable for the required time | ||
if currentMinimumUptime < requiredUptime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edge case kinda but could this get thrown off by a crashlooping coordinator?
The case where a coordinator is missing but will come up, I think the currentMinimumUptime should be time since recovery ( at least) and hopefully it should come up before requiredUptime, but if the coordinator (or something in tx) keeps crashing before requiredUptime couldn't we get stuck here? Alternatively, if there is something wrong with storage servers and one is crashing couldn't that also get us stuck here?
Or are these just not likely scenarios to happen for FDB?
Description
Fix: #2246
Type of change
Discussion
Testing
CI will run e2e tests. I added some additional unit tests.
Documentation
Added docs for the new flags (in the flag help).
Follow-up