🐛 KCP should check if it has machines being deleted before proceeding with scale up/down #2958

sedefsavas · 2020-04-24T16:09:08Z

What this PR does / why we need it:
This PR fixes two different issues I observe during KCP single control plane upgrade.

machine_controller was not allowing machine deletion thinking that it is the last control plane machine.
Not having a check for machines that are in deletion process during MachinesNeedingUpgrade() was causing a second machine with the new version to come up. It was creating a machine with new version, adding a DeletionTimestamp for the old node, and at this point when it comes to reconcile, it thinks there is still 1 machine to be upgraded (the one with the deletion timestamp) and it scales up one more time.
Besides of creating a second machine, this was causing ForwardLeaderElection to panic from time to time.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2915

vincepri · 2020-04-24T16:11:17Z

controlplane/kubeadm/internal/control_plane.go

-		)
+		).Filter(machinefilters.Not(machinefilters.HasDeletionTimestamp))
 	}

 	return c.Machines.Filter(
 		machinefilters.Not(machinefilters.MatchesConfigurationHash(c.SpecHash())),
+		machinefilters.Not(machinefilters.HasDeletionTimestamp),


cc @benmoss @detiber

This can be added in scaleUpControlPlane() similar to scaleDownControlPlane():

// Wait for any delete in progress to complete before deleting another Machine if controlPlane.HasDeletingMachine() { return ctrl.Result{}, &capierrors.RequeueAfterError{RequeueAfter: deleteRequeueAfter} }

I prefer to do it in MachinesNeedingUpgrade(), because Machines that are about to get deleted should not be included in that list logically.

sedefsavas · 2020-04-24T16:19:11Z

BTW, the second issue is happening with 3-node control plane upgrades too, it was creating one additional machine and then deleting it.

vincepri · 2020-04-24T16:33:49Z

is_last_control_plane_members test should be updated

benmoss · 2020-04-24T18:49:17Z

I think I need help understanding the chronology here. With a one node cluster the way it's supposed to work is:

We upgrade, triggering scale up
We continue to call scaleUpControlPlane until the new machine joins the cluster, but it returns early since there is a machine without a nodeRef
Once the machine has a nodeRef we expect status.Nodes to be > spec.Replicas. We call scaleDownControlPlane, triggering deletion of the old machine
We continue to call scaleDownControlPlane, but it returns early since there is a deleting machine
There are no more machines needing upgrade

There must be something untrue about this for what you're seeing to be happening. All I can think of is that there's some mismatch between ClusterStatus.Nodes and the Machines we're seeing.

For the other issue, I don't think we want to just drop the numControlPlaneMachines == 1 check. We don't want to delete the last control plane node, so if numControlPlaneMachines is giving us a wrong number we should fix that.

benmoss · 2020-04-24T21:34:37Z

What we've learned:

the faulty assumption is that scaleUpControlPlane will always return early. reconcileHealth makes a request to the Kubernetes API in TargetClusterControlPlaneIsHealthy that means by the time we reach this code the new machine may have successfully joined the cluster. At this point we have already decided to scale up, but we really are intending to wait for the Provisioning machine to become Running. We pass the reconcileHealth healthcheck and create another new machine.
Scale down doesn't have the same race condition because it calls HasDeletingMachine() on the controlPlane, which is state we already have in memory.

detiber

If I'm reading this change correctly in the context of the existing state of the code, I believe that we have a potential bug related to the following operations happening in quick succession that may cause multiple scale up events to happen in quick succession without waiting for them to finish becoming ready.

Initiate an upgrade on a KCP resource with a single replica (either by setting UpgradeAfter or updating the Version)
Scale the kcp to 3 replicas.

Unless I'm missing something, it looks like there is nothing that is preventing upgrade (or scale up itself) from performing multiple scale up operations without waiting for the previous scale up operation to fully complete first.

I also wonder if we have a potential race condition between joining a new control plane node to the cluster and the behavior of ReconcileEtcdMembers(). Since we are using static pods, there are no guarantees that the Node resource is created in the apiserver before the etcd Pod is up and running on the host. It looks like if we call ReconcileEtcdMembers() at just the right time during the join process we could effectively remove the etcd membership for the etcd member we just added to the cluster as part of the control plane join. It doesn't look like either the ControlPlaneIsHealthy or EtcdIsHealthy checks would prevent this, since they are only looking at Nodes on the remote cluster to determine the cluster is healthy.

controllers/machine_controller_test.go

sedefsavas · 2020-04-27T19:03:12Z

As an example of the problem with not having MachineDeletion check in Upgrade is:

There is 1 control plane node that needs to be upgraded.
scaleUpControlPlane() is called and new machine is created.
scaleUpControlPlane() is being called until the new node is up.
After the node is up, existing nodes are 2 and expected is 1, and hence scaleDownControlPlane() is called. DeletionTimestamp is added to a machine.
In the next reconcile, since MachinesNeedingUpgrade() still has 1 machine even though it has DeletionTimestamp, it enters upgrade: existing node number is 1 (because node is gone but machine is still there) and expected is 1. scaleUpControlPlane() is called again. Although the node we see here is the one that is upgraded.

This scenario happens in a HA cluster as well.

As a result, extra nodes are being created during upgrade and then scale down happens. Also, ForwardEtcdLeadership() is panicking because we don't have a logic to wait for the second upgraded node to become ready before calling it.

sedefsavas · 2020-04-27T19:14:55Z

If I'm reading this change correctly in the context of the existing state of the code, I believe that we have a potential bug related to the following operations happening in quick succession that may cause multiple scale up events to happen in quick succession without waiting for them to finish becoming ready.

Initiate an upgrade on a KCP resource with a single replica (either by setting UpgradeAfter or updating the Version)

Scale the kcp to 3 replicas.

Unless I'm missing something, it looks like there is nothing that is preventing upgrade (or scale up itself) from performing multiple scale up operations without waiting for the previous scale up operation to fully complete first.

Having a HasDeletingMachine () check in Upgrade actually prevents this because that means it needs to wait until MachinesNeedingUpgrade() has no other machines in deletion process. This ensures if ScaleUp proceeds, it is because there is a machine with old version.

I also wonder if we have a potential race condition between joining a new control plane node to the cluster and the behavior of ReconcileEtcdMembers(). Since we are using static pods, there are no guarantees that the Node resource is created in the apiserver before the etcd Pod is up and running on the host. It looks like if we call ReconcileEtcdMembers() at just the right time during the join process we could effectively remove the etcd membership for the etcd member we just added to the cluster as part of the control plane join. It doesn't look like either the ControlPlaneIsHealthy or EtcdIsHealthy checks would prevent this, since they are only looking at Nodes on the remote cluster to determine the cluster is healthy.

ReconcileEtcdMembers() only removes members if they don't have an underlying machine which should not be the case here. It is basically a cleanup if machine is deleted without successfully cleaning up the node.

sedefsavas · 2020-04-27T19:16:43Z

/test pull-cluster-api-capd-e2e

detiber · 2020-04-27T20:57:42Z

ReconcileEtcdMembers() only removes members if they don't have an underlying machine which should not be the case here. It is basically a cleanup if machine is deleted without successfully cleaning up the node.

Does it? From what I'm seeing it looks like it only checks Nodes against the etcd membership here:

cluster-api/controlplane/kubeadm/internal/workload_cluster_etcd.go

Lines 135 to 181 in fff092e

    
           // ReconcileEtcdMembers iterates over all etcd members and finds members that do not have corresponding nodes. 
        
           // If there are any such members, it deletes them from etcd and removes their nodes from the kubeadm configmap so that kubeadm does not run etcd health checks on them. 
        
           func (w *Workload) ReconcileEtcdMembers(ctx context.Context) error { 
        
           	controlPlaneNodes, err := w.getControlPlaneNodes(ctx) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	errs := []error{} 
        
           	for _, node := range controlPlaneNodes.Items { 
        
           		name := node.Name 
        
           		// Create the etcd Client for the etcd Pod scheduled on the Node 
        
           		etcdClient, err := w.etcdClientGenerator.forNode(ctx, name) 
        
           		if err != nil { 
        
           			continue 
        
           		} 
        
           		members, err := etcdClient.Members(ctx) 
        
           		if err != nil { 
        
           			continue 
        
           		} 
        
           		// Check if any member's node is missing from workload cluster 
        
           		// If any, delete it with best effort 
        
           		for _, member := range members { 
        
           			isFound := false 
        
           			for _, node := range controlPlaneNodes.Items { 
        
           				if member.Name == node.Name { 
        
           					isFound = true 
        
           					break 
        
           				} 
        
           			} 
        
           			// Stop here if we found the member to be in the list of control plane nodes. 
        
           			if isFound { 
        
           				continue 
        
           			} 
        
           			if err := w.removeMemberForNode(ctx, member.Name); err != nil { 
        
           				errs = append(errs, err) 
        
           			} 
        
           			if err := w.RemoveNodeFromKubeadmConfigMap(ctx, member.Name); err != nil { 
        
           				errs = append(errs, err) 
        
           			} 
        
           		} 
        
           	} 
        
           	return kerrors.NewAggregate(errs) 
        
           }

detiber · 2020-04-27T21:02:50Z

Having a HasDeletingMachine () check in Upgrade actually prevents this because that means it needs to wait until MachinesNeedingUpgrade() has no other machines in deletion process. This ensures if ScaleUp proceeds, it is because there is a machine with old version.

In the scenario I mentioned (1 replica control plane updated to 3 immediately after initiating the upgrade) I wouldn't expect there to be any machines that have a deletiontimestamp set yet since scaleDown is only called when the Node count > the replica count. As an aside, this also looks to be using Nodes and not Machines here.

Jaakko-Os · 2020-04-28T10:29:20Z

I read through the discussion and noted similarities with my tests in metal3-dev-env.
I have two bare metal hosts, one provisioned as control plane and the other one is empty on purpose.
k8s version changed in kcp, upgrade starts ok and the empty host is provisioned with a new k8s version.

~/metal3-dev-env$ kubectl get bmh -A -owide
NAMESPACE   NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
metal3      node-0   OK       provisioned           test1-controlplane-dlwp8   ipmi://192.168.111.1:6230   unknown            true     
metal3      node-1   OK       provisioned           test1-controlplane-9jrjv   ipmi://192.168.111.1:6231   unknown            true

On the machine level it is not what I expected. I have two running machines, one with old and one with new k8s version. A third machine pops up and since no resources are available, it is hanging.

~/metal3-dev-env$ kubectl get machines -A -owide
NAMESPACE   NAME                       PROVIDERID                                      PHASE          NODENAME
metal3      test1-controlplane-5znpt                                                   Provisioning   
metal3      test1-controlplane-rlsml   metal3://96be1a27-1f8d-4058-83e0-f93af44c3759   Running        node-1
metal3      test1-controlplane-xx2jh   metal3://e8304607-dec9-4ed8-9ac6-bcacb1a0a47e   Running        node-0

I would not expect to see a third machine provisioning, instead de-provisioning of the old running machine.

vincepri · 2020-04-28T15:05:30Z

/milestone v0.3.4

vincepri · 2020-04-28T16:40:14Z

@Jaakko-Os do you have some time to try the fix that's in this PR by building a custom image?

vincepri

/approve

@sedefsavas the only thing left to do in this PR is to get the check at the bottom of reconcileHealth as a temporary workaround for what we discussed before + document the reasons around code

did I miss anything else? @detiber @benmoss @fabriziopandini

k8s-ci-robot · 2020-04-28T19:35:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sedefsavas, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2020-04-28T20:32:39Z

/retitle 🐛 KCP should check if it has machines being deleted before proceeding with scale up/down

vincepri · 2020-04-28T22:28:04Z

/lgtm

Jaakko-Os · 2020-04-29T04:17:48Z

@Jaakko-Os do you have some time to try the fix that's in this PR by building a custom image?

@vincepri I'll run my test again

vincepri · 2020-04-29T04:29:02Z

@Jaakko-Os We released v0.3.4 which includes this fix, would love to know if that works for you!

Jaakko-Os · 2020-04-29T07:49:50Z

@vincepri I repeated my test and now the upgrade was successful!

Xenwar · 2020-04-29T07:56:18Z

@vincepri I have also successfully upgraded kubernetes version from 1.18.0 to 1.18.2.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 24, 2020

k8s-ci-robot requested review from CecileRobertMichon and detiber April 24, 2020 16:09

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 24, 2020

vincepri reviewed Apr 24, 2020

View reviewed changes

sedefsavas force-pushed the kcp_upgrade_fix branch from 9aac926 to 4a0e42e Compare April 24, 2020 16:55

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 24, 2020

sedefsavas force-pushed the kcp_upgrade_fix branch from 4a0e42e to 67af527 Compare April 27, 2020 17:04

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 27, 2020

detiber reviewed Apr 27, 2020

View reviewed changes

controllers/machine_controller_test.go Show resolved Hide resolved

vincepri mentioned this pull request Apr 28, 2020

[KCP] Fails to upgrade single node control plane #2915

Closed

k8s-ci-robot added this to the v0.3.4 milestone Apr 28, 2020

vincepri approved these changes Apr 28, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2020

sedefsavas force-pushed the kcp_upgrade_fix branch from 67af527 to 8d43364 Compare April 28, 2020 19:58

k8s-ci-robot changed the title ~~🐛Fix for single node KCP upgrade~~ 🐛 KCP should check if it has machines being deleted before proceeding with scale up/down Apr 28, 2020

sedefsavas force-pushed the kcp_upgrade_fix branch 2 times, most recently from 342d83c to 4179882 Compare April 28, 2020 21:33

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 28, 2020

sedefsavas force-pushed the kcp_upgrade_fix branch from 4179882 to 4d50be2 Compare April 28, 2020 21:37

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2020

Fix for 1 node KCP upgrade

8bbdd95

sedefsavas force-pushed the kcp_upgrade_fix branch from 4d50be2 to 8bbdd95 Compare April 28, 2020 22:27

k8s-ci-robot assigned vincepri Apr 28, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2020

k8s-ci-robot merged commit 7cda00b into kubernetes-sigs:master Apr 28, 2020

Xenwar mentioned this pull request Apr 30, 2020

Upgrade not running to completion when changing OS image metal3-io/cluster-api-provider-metal3#44

Closed

🐛 KCP should check if it has machines being deleted before proceeding with scale up/down #2958

🐛 KCP should check if it has machines being deleted before proceeding with scale up/down #2958

Uh oh!

Conversation

sedefsavas commented Apr 24, 2020

Uh oh!

vincepri Apr 24, 2020

Choose a reason for hiding this comment

Uh oh!

sedefsavas Apr 24, 2020

Choose a reason for hiding this comment

Uh oh!

sedefsavas commented Apr 24, 2020

Uh oh!

vincepri commented Apr 24, 2020

Uh oh!

benmoss commented Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benmoss commented Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

detiber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sedefsavas commented Apr 27, 2020

Uh oh!

sedefsavas commented Apr 27, 2020

Uh oh!

sedefsavas commented Apr 27, 2020

Uh oh!

detiber commented Apr 27, 2020

Uh oh!

detiber commented Apr 27, 2020

Uh oh!

Jaakko-Os commented Apr 28, 2020

Uh oh!

vincepri commented Apr 28, 2020

Uh oh!

vincepri commented Apr 28, 2020

Uh oh!

vincepri left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 28, 2020

Uh oh!

vincepri commented Apr 28, 2020

Uh oh!

vincepri commented Apr 28, 2020

Uh oh!

Jaakko-Os commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincepri commented Apr 29, 2020

Uh oh!

Jaakko-Os commented Apr 29, 2020

Uh oh!

Xenwar commented Apr 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

benmoss commented Apr 24, 2020 •

edited

Loading

benmoss commented Apr 24, 2020 •

edited

Loading

Jaakko-Os commented Apr 29, 2020 •

edited

Loading