Skip to content

Conversation

sedefsavas
Copy link

What this PR does / why we need it:
This PR fixes two different issues I observe during KCP single control plane upgrade.

  1. machine_controller was not allowing machine deletion thinking that it is the last control plane machine.

  2. Not having a check for machines that are in deletion process during MachinesNeedingUpgrade() was causing a second machine with the new version to come up. It was creating a machine with new version, adding a DeletionTimestamp for the old node, and at this point when it comes to reconcile, it thinks there is still 1 machine to be upgraded (the one with the deletion timestamp) and it scales up one more time.
    Besides of creating a second machine, this was causing ForwardLeaderElection to panic from time to time.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2915

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 24, 2020
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 24, 2020
Comment on lines 103 to 108
)
).Filter(machinefilters.Not(machinefilters.HasDeletionTimestamp))
}

return c.Machines.Filter(
machinefilters.Not(machinefilters.MatchesConfigurationHash(c.SpecHash())),
machinefilters.Not(machinefilters.HasDeletionTimestamp),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be added in scaleUpControlPlane() similar to scaleDownControlPlane():

	// Wait for any delete in progress to complete before deleting another Machine
	if controlPlane.HasDeletingMachine() {
		return ctrl.Result{}, &capierrors.RequeueAfterError{RequeueAfter: deleteRequeueAfter}
	}

I prefer to do it in MachinesNeedingUpgrade(), because Machines that are about to get deleted should not be included in that list logically.

@sedefsavas
Copy link
Author

BTW, the second issue is happening with 3-node control plane upgrades too, it was creating one additional machine and then deleting it.

@vincepri
Copy link
Member

is_last_control_plane_members test should be updated

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 24, 2020
@benmoss
Copy link

benmoss commented Apr 24, 2020

I think I need help understanding the chronology here. With a one node cluster the way it's supposed to work is:

  1. We upgrade, triggering scale up
  2. We continue to call scaleUpControlPlane until the new machine joins the cluster, but it returns early since there is a machine without a nodeRef
  3. Once the machine has a nodeRef we expect status.Nodes to be > spec.Replicas. We call scaleDownControlPlane, triggering deletion of the old machine
  4. We continue to call scaleDownControlPlane, but it returns early since there is a deleting machine
  5. There are no more machines needing upgrade

There must be something untrue about this for what you're seeing to be happening. All I can think of is that there's some mismatch between ClusterStatus.Nodes and the Machines we're seeing.

For the other issue, I don't think we want to just drop the numControlPlaneMachines == 1 check. We don't want to delete the last control plane node, so if numControlPlaneMachines is giving us a wrong number we should fix that.

@benmoss
Copy link

benmoss commented Apr 24, 2020

What we've learned:

  • the faulty assumption is that scaleUpControlPlane will always return early. reconcileHealth makes a request to the Kubernetes API in TargetClusterControlPlaneIsHealthy that means by the time we reach this code the new machine may have successfully joined the cluster. At this point we have already decided to scale up, but we really are intending to wait for the Provisioning machine to become Running. We pass the reconcileHealth healthcheck and create another new machine.
  • Scale down doesn't have the same race condition because it calls HasDeletingMachine() on the controlPlane, which is state we already have in memory.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 27, 2020
Copy link
Contributor

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this change correctly in the context of the existing state of the code, I believe that we have a potential bug related to the following operations happening in quick succession that may cause multiple scale up events to happen in quick succession without waiting for them to finish becoming ready.

  1. Initiate an upgrade on a KCP resource with a single replica (either by setting UpgradeAfter or updating the Version)
  2. Scale the kcp to 3 replicas.

Unless I'm missing something, it looks like there is nothing that is preventing upgrade (or scale up itself) from performing multiple scale up operations without waiting for the previous scale up operation to fully complete first.

I also wonder if we have a potential race condition between joining a new control plane node to the cluster and the behavior of ReconcileEtcdMembers(). Since we are using static pods, there are no guarantees that the Node resource is created in the apiserver before the etcd Pod is up and running on the host. It looks like if we call ReconcileEtcdMembers() at just the right time during the join process we could effectively remove the etcd membership for the etcd member we just added to the cluster as part of the control plane join. It doesn't look like either the ControlPlaneIsHealthy or EtcdIsHealthy checks would prevent this, since they are only looking at Nodes on the remote cluster to determine the cluster is healthy.

@sedefsavas
Copy link
Author

As an example of the problem with not having MachineDeletion check in Upgrade is:

  1. There is 1 control plane node that needs to be upgraded.
  2. scaleUpControlPlane() is called and new machine is created.
  3. scaleUpControlPlane() is being called until the new node is up.
  4. After the node is up, existing nodes are 2 and expected is 1, and hence scaleDownControlPlane() is called. DeletionTimestamp is added to a machine.
  5. In the next reconcile, since MachinesNeedingUpgrade() still has 1 machine even though it has DeletionTimestamp, it enters upgrade: existing node number is 1 (because node is gone but machine is still there) and expected is 1. scaleUpControlPlane() is called again. Although the node we see here is the one that is upgraded.

This scenario happens in a HA cluster as well.

As a result, extra nodes are being created during upgrade and then scale down happens. Also, ForwardEtcdLeadership() is panicking because we don't have a logic to wait for the second upgraded node to become ready before calling it.

@sedefsavas
Copy link
Author

If I'm reading this change correctly in the context of the existing state of the code, I believe that we have a potential bug related to the following operations happening in quick succession that may cause multiple scale up events to happen in quick succession without waiting for them to finish becoming ready.

  1. Initiate an upgrade on a KCP resource with a single replica (either by setting UpgradeAfter or updating the Version)
  2. Scale the kcp to 3 replicas.

Unless I'm missing something, it looks like there is nothing that is preventing upgrade (or scale up itself) from performing multiple scale up operations without waiting for the previous scale up operation to fully complete first.

Having a HasDeletingMachine () check in Upgrade actually prevents this because that means it needs to wait until MachinesNeedingUpgrade() has no other machines in deletion process. This ensures if ScaleUp proceeds, it is because there is a machine with old version.

I also wonder if we have a potential race condition between joining a new control plane node to the cluster and the behavior of ReconcileEtcdMembers(). Since we are using static pods, there are no guarantees that the Node resource is created in the apiserver before the etcd Pod is up and running on the host. It looks like if we call ReconcileEtcdMembers() at just the right time during the join process we could effectively remove the etcd membership for the etcd member we just added to the cluster as part of the control plane join. It doesn't look like either the ControlPlaneIsHealthy or EtcdIsHealthy checks would prevent this, since they are only looking at Nodes on the remote cluster to determine the cluster is healthy.

ReconcileEtcdMembers() only removes members if they don't have an underlying machine which should not be the case here. It is basically a cleanup if machine is deleted without successfully cleaning up the node.

@sedefsavas
Copy link
Author

/test pull-cluster-api-capd-e2e

@detiber
Copy link
Contributor

detiber commented Apr 27, 2020

ReconcileEtcdMembers() only removes members if they don't have an underlying machine which should not be the case here. It is basically a cleanup if machine is deleted without successfully cleaning up the node.

Does it? From what I'm seeing it looks like it only checks Nodes against the etcd membership here:

// ReconcileEtcdMembers iterates over all etcd members and finds members that do not have corresponding nodes.
// If there are any such members, it deletes them from etcd and removes their nodes from the kubeadm configmap so that kubeadm does not run etcd health checks on them.
func (w *Workload) ReconcileEtcdMembers(ctx context.Context) error {
controlPlaneNodes, err := w.getControlPlaneNodes(ctx)
if err != nil {
return err
}
errs := []error{}
for _, node := range controlPlaneNodes.Items {
name := node.Name
// Create the etcd Client for the etcd Pod scheduled on the Node
etcdClient, err := w.etcdClientGenerator.forNode(ctx, name)
if err != nil {
continue
}
members, err := etcdClient.Members(ctx)
if err != nil {
continue
}
// Check if any member's node is missing from workload cluster
// If any, delete it with best effort
for _, member := range members {
isFound := false
for _, node := range controlPlaneNodes.Items {
if member.Name == node.Name {
isFound = true
break
}
}
// Stop here if we found the member to be in the list of control plane nodes.
if isFound {
continue
}
if err := w.removeMemberForNode(ctx, member.Name); err != nil {
errs = append(errs, err)
}
if err := w.RemoveNodeFromKubeadmConfigMap(ctx, member.Name); err != nil {
errs = append(errs, err)
}
}
}
return kerrors.NewAggregate(errs)
}

@detiber
Copy link
Contributor

detiber commented Apr 27, 2020

Having a HasDeletingMachine () check in Upgrade actually prevents this because that means it needs to wait until MachinesNeedingUpgrade() has no other machines in deletion process. This ensures if ScaleUp proceeds, it is because there is a machine with old version.

In the scenario I mentioned (1 replica control plane updated to 3 immediately after initiating the upgrade) I wouldn't expect there to be any machines that have a deletiontimestamp set yet since scaleDown is only called when the Node count > the replica count. As an aside, this also looks to be using Nodes and not Machines here.

@Jaakko-Os
Copy link
Contributor

I read through the discussion and noted similarities with my tests in metal3-dev-env.
I have two bare metal hosts, one provisioned as control plane and the other one is empty on purpose.
k8s version changed in kcp, upgrade starts ok and the empty host is provisioned with a new k8s version.

~/metal3-dev-env$ kubectl get bmh -A -owide
NAMESPACE   NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
metal3      node-0   OK       provisioned           test1-controlplane-dlwp8   ipmi://192.168.111.1:6230   unknown            true     
metal3      node-1   OK       provisioned           test1-controlplane-9jrjv   ipmi://192.168.111.1:6231   unknown            true

On the machine level it is not what I expected. I have two running machines, one with old and one with new k8s version. A third machine pops up and since no resources are available, it is hanging.

~/metal3-dev-env$ kubectl get machines -A -owide
NAMESPACE   NAME                       PROVIDERID                                      PHASE          NODENAME
metal3      test1-controlplane-5znpt                                                   Provisioning   
metal3      test1-controlplane-rlsml   metal3://96be1a27-1f8d-4058-83e0-f93af44c3759   Running        node-1
metal3      test1-controlplane-xx2jh   metal3://e8304607-dec9-4ed8-9ac6-bcacb1a0a47e   Running        node-0

I would not expect to see a third machine provisioning, instead de-provisioning of the old running machine.

@vincepri
Copy link
Member

/milestone v0.3.4

@k8s-ci-robot k8s-ci-robot added this to the v0.3.4 milestone Apr 28, 2020
@vincepri
Copy link
Member

@Jaakko-Os do you have some time to try the fix that's in this PR by building a custom image?

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@sedefsavas the only thing left to do in this PR is to get the check at the bottom of reconcileHealth as a temporary workaround for what we discussed before + document the reasons around code

did I miss anything else? @detiber @benmoss @fabriziopandini

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sedefsavas, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2020
@vincepri
Copy link
Member

/retitle 🐛 KCP should check if it has machines being deleted before proceeding with scale up/down

@k8s-ci-robot k8s-ci-robot changed the title 🐛Fix for single node KCP upgrade 🐛 KCP should check if it has machines being deleted before proceeding with scale up/down Apr 28, 2020
@sedefsavas sedefsavas force-pushed the kcp_upgrade_fix branch 2 times, most recently from 342d83c to 4179882 Compare April 28, 2020 21:33
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 28, 2020
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2020
@vincepri
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7cda00b into kubernetes-sigs:master Apr 28, 2020
@Jaakko-Os
Copy link
Contributor

Jaakko-Os commented Apr 29, 2020

@Jaakko-Os do you have some time to try the fix that's in this PR by building a custom image?

@vincepri I'll run my test again

@vincepri
Copy link
Member

@Jaakko-Os We released v0.3.4 which includes this fix, would love to know if that works for you!

@Jaakko-Os
Copy link
Contributor

@vincepri I repeated my test and now the upgrade was successful!

@Xenwar
Copy link

Xenwar commented Apr 29, 2020

@vincepri I have also successfully upgraded kubernetes version from 1.18.0 to 1.18.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[KCP] Fails to upgrade single node control plane

7 participants