-
Notifications
You must be signed in to change notification settings - Fork 132
Introduce machine termination queue to handle deletions #964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@takoverflow Thank you for your contribution. |
Tested scenarioScale-up/Scale-down of a Machine DeploymentBy causing a scale-down in a deployment having a large replica count, numerous machines simultaneously go to the terminating state where they clog up the machine queue preventing any new machine creations (which is simulated by causing a scale-up in a different Machine Deployment)
Due to the terminating machines hogging all the workers in the machine queue, creation requests can't go through. The same scenario when replicated with the new |
Stress TestDelete 4 machine deployments and Scale-up a new oneBy causing deletions in deployments having a large replica count, numerous machines simultaneously go to the terminating state where they clog up the machine queue preventing any new machine creations (which is simulated by causing a scale-up in a different Machine Deployment)
Looking at the logs for the timestamps when the machines were scheduled for deletion and when the first machine from the newly scaled deployment( So the first machine that was scheduled for deletion was at After the changeThe same scenario when replicated with the new Again, looking at the timestamps The first machine scheduled for deletion was at |
532e31d to
3d5e843
Compare
| */ | ||
| func (c *controller) addMachine(obj interface{}) { | ||
| c.enqueueMachine(obj, "handling machine obj ADD event") | ||
| machine := obj.(*v1alpha1.Machine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can unfortunately panic in those cases where obj is tomb-stoned. See deleteMachineDeployment. This is a common mistake in the delete callback. (I also made the same in the past)
| } | ||
|
|
||
| func (c *controller) enqueueMachineTermination(obj interface{}, reason string) { | ||
| if key, ok := c.getKeyForObj(obj); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callers should ensure that tombstone handling is done correctly otherwise this will fail.
Rather than explicitly setting the LastOperation value and updating the status for multiple error codes, fall back to using the already available ErrorHandler to clean up the creation flow.
Isolated the machine deletion flow into a separate workerqueue that only processes terminating machines, the remaining machine reconciliation flow happens as is in the `machineQueue`. The machine watch functions for `update` and `delete` events are updated to account for the introduction of the new queue.
In case the MCM process stops during deletion, upon restart the same terminating machine would be added to the normal machine queue (as part of ADD event) and since its deletion timestamp is set, it is not processed any further hence the deletion never proceeds. By checking for a terminating machine on ADD events, this is resolved.
d4475c7 to
747151a
Compare
elankath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good with minor safety re-queue
elankath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm thanks for the PR
What this PR does / why we need it:
Adds a new queue to handle machine objects scheduled for
deletion to unblock machine creation/update requests.
Which issue(s) this PR fixes:
Fixes #943
Special notes for your reviewer:
Release note: