Skip to content

Conversation

@takoverflow
Copy link
Member

@takoverflow takoverflow commented Jan 29, 2025

What this PR does / why we need it:
Adds a new queue to handle machine objects scheduled for
deletion to unblock machine creation/update requests.

Which issue(s) this PR fixes:
Fixes #943

Special notes for your reviewer:

Release note:

A new termination queue to handle machines scheduled for deletion introduced to separate creation requests from deletion

@gardener-robot
Copy link

@takoverflow Thank you for your contribution.

@gardener-robot gardener-robot added needs/review Needs review size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 29, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 29, 2025
@takoverflow
Copy link
Member Author

takoverflow commented Jan 31, 2025

Tested scenario

Scale-up/Scale-down of a Machine Deployment

By causing a scale-down in a deployment having a large replica count, numerous machines simultaneously go to the terminating state where they clog up the machine queue preventing any new machine creations (which is simulated by causing a scale-up in a different Machine Deployment)

kubectl get mcd
NAME                              READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i752152--scale-test-a-z1                                              0s
shoot--i752152--scale-test-b-z1           1                                  0s
  1. Increase the replica count of one of the machine deployments (a).
kubectl scale mcd shoot--i752152--scale-test-a-z1 --replicas=100
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-a-z1 scaled
  1. Check that all the machines are running, then scale-down the Machine Deployment (a) to simulate deletion of numerous machines at once.
kubectl get machine | grep Running | wc -l
101
kubectl scale mcd shoot--i752152--scale-test-a-z1 --replicas=0
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-a-z1 scaled
  1. Scale up the other deployment (b) to have pending machine creation requests with a bit of a wait to allow the deletion requests to all process so that the creation requests can't seep in between.
sleep 120
kubectl scale mcd shoot--i752152--scale-test-b-z1 --replicas=100
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-b-z1 scaled
  1. Wait for some time and then check status of the number of machines that are running. The deleted machine requests were artificially delayed in the virtual provider to simulate real time-consuming deletion process.
sleep 180
kubectl get machine | grep 'a-z1.*Terminating' | wc -l
100
kubectl get machine | grep 'b-z1.*Running' | wc -l
1

Due to the terminating machines hogging all the workers in the machine queue, creation requests can't go through. The same scenario when replicated with the new machineTerminationQueue allows for separating the time-consuming deletion flow into a different queue thus allowing for creation requests to not have to wait for longer duration.

sleep 180
kubectl get machine | grep 'a-z1.*Terminating' | wc -l
100
kubectl get machine | grep 'b-z1.*Running' | wc -l
100

@takoverflow takoverflow marked this pull request as ready for review February 7, 2025 04:53
@takoverflow takoverflow requested a review from a team as a code owner February 7, 2025 04:53
@takoverflow
Copy link
Member Author

takoverflow commented Feb 7, 2025

Stress Test

Delete 4 machine deployments and Scale-up a new one

By causing deletions in deployments having a large replica count, numerous machines simultaneously go to the terminating state where they clog up the machine queue preventing any new machine creations (which is simulated by causing a scale-up in a different Machine Deployment)

+ kubectl get mcd
NAME                              READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i752152--scale-test-a-z1                                              1s
shoot--i752152--scale-test-b-z1                                              1s
shoot--i752152--scale-test-c-z1                                              1s
shoot--i752152--scale-test-d-z1           1                                  1s
shoot--i752152--scale-test-e-z1                                              0s
  1. Increase the replica count of four of the machine deployments (a, b, c and d).
+ kubectl scale mcd shoot--i752152--scale-test-a-z1 --replicas=100
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-a-z1 scaled
+ kubectl scale mcd shoot--i752152--scale-test-b-z1 --replicas=100
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-b-z1 scaled
+ kubectl scale mcd shoot--i752152--scale-test-c-z1 --replicas=100
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-c-z1 scaled
+ kubectl scale mcd shoot--i752152--scale-test-d-z1 --replicas=100
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-d-z1 scaled
  1. Schedule a pod on each machine, wait for some time and then delete all four Machine Deployments.
+ kubectl delete mcd shoot--i752152--scale-test-a-z1
machinedeployment.machine.sapcloud.io "shoot--i752152--scale-test-a-z1" deleted
+ kubectl delete mcd shoot--i752152--scale-test-b-z1
machinedeployment.machine.sapcloud.io "shoot--i752152--scale-test-b-z1" deleted
+ kubectl delete mcd shoot--i752152--scale-test-c-z1
machinedeployment.machine.sapcloud.io "shoot--i752152--scale-test-c-z1" deleted
+ kubectl delete mcd shoot--i752152--scale-test-d-z1
machinedeployment.machine.sapcloud.io "shoot--i752152--scale-test-d-z1" deleted
  1. Scale up the remaining deployment (e) to have pending machine creation requests with a bit of a wait to allow the deletion requests to all process so that the creation requests can't seep in between.
+ sleep 120
+ kubectl scale mcd shoot--i752152--scale-test-e-z1 --replicas=100
machinedeployment.machine.sapcloud.io/shoot--i752152--scale-test-e-z1 scaled
  1. Wait for some time and then check status of the number of machines that are running. The deleted machine requests were artificially delayed (by 10m each) in the virtual provider to simulate real time-consuming deletion process.
+ sleep 180
+ kubectl get machine | grep 'Terminating' | wc -l
400
+ kubectl get machine | grep 'e-z1.*Running' | wc -l
0

Looking at the logs for the timestamps when the machines were scheduled for deletion and when the first machine from the newly scaled deployment(e) started running:

# Get the first machine that goes to terminating
+ TERM=$(grep -Pi "Machine.*status updated to terminating" old.log | head -n 1 | sed 's|.*"\(.*\)".*|\1|')
shoot--i752152--scale-test-a-z1-6f964-2z6hs

# Check the timestamp for it to go to terminating state
+ grep -Pi 'machine "shoot--i752152--scale-test-a-z1-6f964-2z6hs" status updated to terminating' old.log
I0204 21:45:19.924412   35563 machine_util.go:936] Machine "shoot--i752152--scale-test-a-z1-6f964-2z6hs" status updated to terminating

# Get the first new machine that's created
+ NEW=$(grep -i "adding machine object to queue" old.log | grep 'e-z1' | head -n 1 | sed 's|.*".*/\(.*\)".*|\1|')
shoot--i752152--scale-test-e-z1-d5fc5-crrfj

# Check when it gets to "Running" state
+ grep -Pi 'Start for "shoot--i752152--scale-test-e-z1-d5fc5-crrfj" with phase:"Running"' old.log | head -n 1
I0204 23:23:44.720681   35563 machine.go:116] reconcileClusterMachine: Start for "shoot--i752152--scale-test-e-z1-d5fc5-crrfj" with phase:"Running", description:"Machine shoot--i752152--scale-test-e-z1-d5fc5-crrfj successfully joined the cluster"

So the first machine that was scheduled for deletion was at 21:45:19 and the first machine that went to "Running" state for the newly scaled deployment was at 23:23:44. Due to the terminating machines hogging all the workers in the machine queue, creation requests can't go through and it takes almost 1h40m for successful creation of new machines.

After the change

The same scenario when replicated with the new machineTerminationQueue allows for separating the time-consuming deletion flow into a different queue thus allowing for creation requests to not have to wait for longer duration.

+ sleep 180
+ kubectl get machine | grep 'Terminating' | wc -l
400
+ kubectl get machine | grep 'e-z1.*Running' | wc -l
100

Again, looking at the timestamps

# Get the first machine that goes to terminating
+ TERM=$(grep -i "adding machine object to termination queue" new.log | head -n 1 | sed 's|.*".*/\(.*\)".*|\1|')
shoot--i752152--scale-test-b-z1-64874-b6hjs

# Check the timestamp for it to go to terminating state
+ grep -Pi 'shoot--i752152--scale-test-b-z1-64874-b6hjs.*reason: handling terminating machine object UPDATE event' new.log
I0204 23:46:33.287374   49438 machine.go:92] Adding machine object to termination queue "shoot--i752152--scale-test/shoot--i752152--scale-test-b-z1-64874-b6hjs", reason: handling terminating machine object UPDATE event

# Get the first new machine that's created
+ NEW=$(grep -i "adding machine object to queue" new.log | grep 'e-z1' | head -n 1 | sed 's|.*".*/\(.*\)".*|\1|')
shoot--i752152--scale-test-e-z1-d5fc5-v8dhj

# Check when it gets to "Running" state
+ grep -Pi 'Start for "shoot--i752152--scale-test-e-z1-d5fc5-v8dhj" with phase:"Running"' new.log | head -n 1
I0204 23:51:08.974343   49438 machine.go:140] reconcileClusterMachine: Start for "shoot--i752152--scale-test-e-z1-d5fc5-v8dhj" with phase:"Running", description:"Machine shoot--i752152--scale-test-e-z1-d5fc5-v8dhj successfully joined the cluster"

The first machine scheduled for deletion was at 23:46:33 and the first "Running" machine from the new deployment is at 23:51:08 that's a difference of ~5m and removing 2m from the sleep 120 invocation earlier, it takes close to 3m for the new creation requests to go through.

@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 12, 2025
@ghost ghost added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 13, 2025
*/
func (c *controller) addMachine(obj interface{}) {
c.enqueueMachine(obj, "handling machine obj ADD event")
machine := obj.(*v1alpha1.Machine)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can unfortunately panic in those cases where obj is tomb-stoned. See deleteMachineDeployment. This is a common mistake in the delete callback. (I also made the same in the past)

}

func (c *controller) enqueueMachineTermination(obj interface{}, reason string) {
if key, ok := c.getKeyForObj(obj); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callers should ensure that tombstone handling is done correctly otherwise this will fail.

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Feb 26, 2025
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 27, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 27, 2025
Rather than explicitly setting the LastOperation value
and updating the status for multiple error codes, fall
back to using the already available ErrorHandler to
clean up the creation flow.
Isolated the machine deletion flow into a separate workerqueue
that only processes terminating machines, the remaining machine
reconciliation flow happens as is in the `machineQueue`. The
machine watch functions for `update` and `delete` events are
updated to account for the introduction of the new queue.
In case the MCM process stops during deletion, upon restart
the same terminating machine would be added to the normal
machine queue (as part of ADD event) and since its deletion
timestamp is set, it is not processed any further hence the
deletion never proceeds. By checking for a terminating machine
on ADD events, this is resolved.
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 28, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 28, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 28, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 28, 2025
@takoverflow takoverflow requested a review from unmarshall March 11, 2025 10:24
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 12, 2025
@ghost ghost removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 12, 2025
@takoverflow takoverflow requested a review from elankath March 12, 2025 08:33
Copy link
Member

@elankath elankath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good with minor safety re-queue

@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 12, 2025
@takoverflow takoverflow requested a review from elankath March 13, 2025 07:41
Copy link
Member

@elankath elankath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm thanks for the PR

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review labels Mar 13, 2025
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 13, 2025
@aaronfern aaronfern merged commit 68dfbf4 into gardener:master Mar 17, 2025
8 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/closed Issue is closed (either delivered or triaged)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scalability issues with MCM

8 participants