Skip to content

KCP deleted without deleting the CP Machine #3198

Closed
@jzhoucliqr

Description

@jzhoucliqr

When cluster delete happens before KCP controller able to add finalizer, the KCP object is deleted without deleting the CP Machine.

Seen this with capz. what happened is, the first time when kcp reconcile doing init, before the finalizer is added, it's trying to get remote cluster status before calling patchHelper, but since first CP node is not up yet, the call to remote cluster stuck and timeout after 30 seconds.

During this 30 seconds, when delete cluster request come, kcp object does not have finalizer yet, so it was deleted without waiting for CP Machine to be deleted. Then the Cluster object is also deleted and CP machine object left behind not cleaned up.

So either we patch finalizer first before getting remote cluster status, or we use foregroundDelete to delete kcp, because the machine have ownerref set correctly, so even without finalizer the kcp object should still be available.

Let me know what do you guys think. I can have a fix. @vincepri

What did you expect to happen:

Anything else you would like to add:
kcp log:

I0612 06:12:59.113011       1 controller.go:192] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="xgr52" "kubeadmControlPlane"="xgr52-cp" "namespace"="cluster-5ee31604cb45680ec615dd0b" 
I0612 06:13:05.351425       1 controller.go:250] controllers/KubeadmControlPlane "msg"="Initializing control plane" "cluster"="xgr52" "kubeadmControlPlane"="xgr52-cp" "namespace"="cluster-5ee31604cb45680ec615dd0b" "Desired"=1 "Existing"=0
E0612 06:13:35.634203       1 controller.go:160] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: failed to create client for workload cluster cluster-5ee31604cb45680ec615dd0b/xgr52: Get https://xgr52-7dbad68a.centralus.cloudapp.azure.com:6443/api?timeout=30s: dial tcp 52.154.155.142:6443: i/o timeout" "cluster"="xgr52" "kubeadmControlPlane"="xgr52-cp" "namespace"="cluster-5ee31604cb45680ec615dd0b" 
E0612 06:13:35.813173       1 controller.go:166] controllers/KubeadmControlPlane "msg"="Failed to patch KubeadmControlPlane" "error"="kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"xgr52-cp\" not found" "cluster"="xgr52" "kubeadmControlPlane"="xgr52-cp" "namespace"="cluster-5ee31604cb45680ec615dd0b" 
E0612 06:13:35.813562       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="[failed to create remote cluster client: failed to create client for workload cluster cluster-5ee31604cb45680ec615dd0b/xgr52: Get https://xgr52-7dbad68a.centralus.cloudapp.azure.com:6443/api?timeout=30s: dial tcp 52.154.155.142:6443: i/o timeout, kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"xgr52-cp\" not found]"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"cluster-5ee31604cb45680ec615dd0b","Name":"xgr52-cp"}

capi log:

E0612 06:13:07.862684       1 machine_controller.go:226] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"xgr52-cp-httdv\" in namespace \"cluster-5ee31604cb45680ec615dd0b\" is not ready, requeuing: requeue in 30s" "cluster"="xgr52" "machine"="xgr52-cp-httdv" "namespace"="cluster-5ee31604cb45680ec615dd0b" 
I0612 06:13:12.262613       1 cluster_controller.go:282] controllers/Cluster "msg"="Cluster still has descendants - need to requeue" "cluster"="xgr52" "namespace"="cluster-5ee31604cb45680ec615dd0b" "controlPlaneRef"="xgr52-cp"
I0612 06:13:12.949862       1 machine_controller.go:263] controllers/Machine "msg"="Deleting Kubernetes Node associated with Machine is not allowed" "cluster"="xgr52" "machine"="xgr52-cp-httdv" "namespace"="cluster-5ee31604cb45680ec615dd0b" "cause"={} "node"=null
I0612 06:13:13.296689       1 cluster_controller.go:306] controllers/Cluster "msg"="Cluster still has descendants - need to requeue" "cluster"="xgr52" "namespace"="cluster-5ee31604cb45680ec615dd0b" "infrastructureRef"="xgr52"
I0612 06:13:13.564824       1 cluster_controller.go:306] controllers/Cluster "msg"="Cluster still has descendants - need to requeue" "cluster"="xgr52" "namespace"="cluster-5ee31604cb45680ec615dd0b" "infrastructureRef"="xgr52"
I0612 06:13:13.655777       1 machine_controller.go:263] controllers/Machine "msg"="Deleting Kubernetes Node associated with Machine is not allowed" "cluster"="xgr52" "machine"="xgr52-cp-httdv" "namespace"="cluster-5ee31604cb45680ec615dd0b" "cause"={} "node"=null
.....
I0612 06:16:49.643898       1 machine_controller.go:263] controllers/Machine "msg"="Deleting Kubernetes Node associated with Machine is not allowed" "cluster"="xgr52" "machine"="xgr52-cp-httdv" "namespace"="cluster-5ee31604cb45680ec615dd0b" "cause"={} "node"=null
E0612 06:17:30.564624       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to get cluster \"xgr52\" for machine \"xgr52-cp-httdv\" in namespace \"cluster-5ee31604cb45680ec615dd0b\": Cluster.cluster.x-k8s.io \"xgr52\" not found"  "controller"="machine" "request"={"Namespace":"cluster-5ee31604cb45680ec615dd0b","Name":"xgr52-cp-httdv"}

Environment:

  • Cluster-api version: 0.3.6
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions