Open
Description
What steps did you take and what happened?
- Create a cluster with any number of machines.
- Once the cluster is ready, take down the pod(s) running CAPI controllers (I just set their image tag so some invalid tag and waited for them to go into
ImagePullBackoff
). - Bring the controller pods back up, and while they're booting, issue the command to delete the cluster until the request actually gets a successful response – usually this happens before all the controllers/workers/eventsources have been fully initialized. (Our system that uses CAPI does this – it retries the cluster delete command until it gets a successful response).
- Wait for CAPI to delete the cluster.
You'll notice that CAPI actually creates new machines while deleting the existing ones in the cluster, and then it stalls for a while (maybe 15-20mins, but certainly far longer than usual) before it finally deletes the cluster.
What did you expect to happen?
The cluster is deleted after some reasonable period of time (just after the underlying infrastructure is deleted).
Cluster API version
v1.4.4
Kubernetes version
v1.26.9+k3s1
Anything else you would like to add?
Here's an annotated version of the CAPI logs from a case where this happened to me.
-
At this point, capi-controller-manager has just started. All the logs before this point are just CAPI saying that it's registering webhooks and starting workers and controllers.
-
Seems like CAPI just noticed that we're deleting a cluster.
18:18:36.495104 1 cluster_controller.go:241] "Cluster still has children - deleting them first" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="fleet-default/bruno1" namespace="fleet-default" name="bruno1" reconcileID=d3cb487a-4f72-49f3-a5d0-5ae9ada54284 count =1
- CAPI deletes the ONLY MachineDeployment in this cluster
18:18:36.495116 1 cluster_controller.go:252] "Deleting child object" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="fleet-default/bruno1" namespace="fleet-default" reconcileID=d3cb487a-4f72-49f3-a5d0-5ae9ada54284 gvk="cluster.x-k8s.io/v1beta1, Kind=MachineDeployment" name="bruno1-pool1"
- Clearly the capi-controller-manager hasn't fully started up yet because some controllers aren't running yet. Not sure if this matters.
18:18:36.495368 1 controller.go:228] "Starting workers" controller="machinedeployment" controllerGroup="cluster.x-k8s.io" controllerKind="MachineDeployment" worker count=10
18:18:36.495392 1 controller.go:228] "Starting workers" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" worker count=10
18:18:36.495832 1 controller.go:228] "Starting workers" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" worker count=10
- At this point there is still only one MachineDeployment and one MachineSet for the cluster. This is correct.
18:18:36.509157 1 cluster_controller.go:267] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="fleet-default/bruno1" namespace="fleet-default" name="bruno1" reconcileID=d3cb487a-4f72-49f3-a5d0-5ae9ada54284 descendants="Machine deployments: bruno1-pool1;Machine sets: bruno1-pool1-598fbdc96dxb6jwb" indirect descendants count=1
- At this point it seems like the MachineDeployment deletion has taken effect as we no longer see the it in this log (unlike in the one above). This seems fine so far.
18:18:36.522988 1 cluster_controller.go:267] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="fleet-default/bruno1" namespace="fleet-default" name="bruno1" reconcileID=294849c4-d9da-4417-b6d0-92debb9b445c descendants="Machine sets: bruno1-pool1-598fbdc96dxb6jwb" indirect descendants count=1
- CAPI is adding a watcher to the existing machine in the cluster.
18:18:36.601989 1 tracker.go:57] "Adding watcher on external object" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=02c2078a-3f71-4ac4-86cc-23119523605b MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" MachineDeployment="fleet-default/bruno1-pool1" Cluster="fleet-default/bruno1" groupVersionKind="rke.cattle.io/v1, Kind=RKEBootstrap"
18:18:36.602007 1 controller.go:151] "Starting EventSource" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" source="kind source: *unstructured.Unstructured"
18:18:36.611476 1 tracker.go:57] "Adding watcher on external object" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=02c2078a-3f71-4ac4-86cc-23119523605b MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" MachineDeployment="fleet-default/bruno1-pool1" Cluster="fleet-default/bruno1" groupVersionKind="rke-machine.cattle.io/v1, Kind=DigitaloceanMachine"
18:18:36.611537 1 controller.go:151] "Starting EventSource" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" source="kind source: *unstructured.Unstructured"
- CAPI is refusing to delete the machine because the associated cluster has a non-zero deletion timestamp. This is expected, I think. See https://github.dev/kubernetes-sigs/cluster-api/blob/00dbf7b9f6322d7ebd06ae2efa703b23354dd37d/internal/controllers/machine/machine_controller.go#L318.
18:18:36.620957 1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=468785aa-80f4-4c88-8428-32b1f98eb133 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" Node="bruno1-pool1-af648dc8-gpjrk" cause="cluster is being deleted"
- CAPI's MachineDelpoyment controller is trying to reconcile the MachineDeployment "bruno1-pool1", but it can't because it already deleted it above. I think this is intended/normal.
18:18:36.641444 1 machinedeployment_controller.go:167] "Failed to reconcile MachineDeployment" err="failed to update revision annotation on MachineDeployment: MachineDeployment.cluster.x-k8s.io \"bruno1-pool1\" not found" controller="machinedeployment" controllerGroup="cluster.x-k8s.io" controllerKind="MachineDeployment" MachineDeployment="fleet-default/bruno1-pool1" namespace="fleet-default" name="bruno1-pool1" reconcileID=b4115e4e-a299-4e15-88fe-74a7698fc3fb Cluster="fleet-default/bruno1"
18:18:36.641731 1 controller.go:329] "Reconciler error" err="failed to update revision annotation on MachineDeployment: MachineDeployment.cluster.x-k8s.io \"bruno1-pool1\" not found" controller="machinedeployment" controllerGroup="cluster.x-k8s.io" controllerKind="MachineDeployment" MachineDeployment="fleet-default/bruno1-pool1" namespace="fleet-default" name="bruno1-pool1" reconcileID=b4115e4e-a299-4e15-88fe-74a7698fc3fb
- Here's where bad things really start to happen. I think CAPI is trying to reconcile the MachineSet for the cluster and sees that, because it already deleted the only machine in the set, it's running 0 machines when it should be running 1. So it tries to create a new machine even though other controllers are trying to delete the cluster.
18:18:36.650528 1 machineset_controller.go:456] "MachineSet is scaling up to 1 replicas by creating 1 machines" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="fleet-default/bruno1-pool1-598fbdc96dxcl6q6" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxcl6q6" reconcileID=79055edb-235b-457f-ba5d-34657880a4db MachineDeployment="fleet-default/bruno1-pool1" Cluster="fleet-default/bruno1" replicas=1 machineCount=0
- The machine controller is waiting for the DigitaloceanMachine corresponding to the machine it deleted earlier to be deleted.
18:18:36.668726 1 machine_controller.go:413] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=468785aa-80f4-4c88-8428-32b1f98eb133 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" DigitaloceanMachine="fleet-default/bruno1-pool1-af648dc8-gpjrk"
- Same error as before.
18:18:36.721011 1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=92761182-3aa3-4f34-abed-c3a54c198b24 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" Node="bruno1-pool1-af648dc8-gpjrk" cause="cluster is being deleted"
- The MachineSet controller just noticed that the new machine has been created successfully. By my logic, this machine (
bruno1-pool1-598fbdc96dxcl6q6-hkbtj
) should not have been created.
18:18:36.742123 1 machineset_controller.go:548] "Created machine 1 of 1" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="fleet-default/bruno1-pool1-598fbdc96dxcl6q6" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxcl6q6" reconcileID=79055edb-235b-457f-ba5d-34657880a4db MachineDeployment="fleet-default/bruno1-pool1" Cluster="fleet-default/bruno1" RKEBootstrap="fleet-default/bruno1-bootstrap-template-stfgx" DigitaloceanMachine="fleet-default/bruno1-pool1-af648dc8-k6jjd" Machine="fleet-default/bruno1-pool1-598fbdc96dxcl6q6-hkbtj"
- Same error as before.
18:18:36.810935 1 machine_controller.go:413] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=92761182-3aa3-4f34-abed-c3a54c198b24 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" DigitaloceanMachine="fleet-default/bruno1-pool1-af648dc8-gpjrk"
18:18:36.831737 1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=34dae327-e65b-4fd9-bb98-a6e9f4fb7917 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" Node="bruno1-pool1-af648dc8-gpjrk" cause="cluster is being deleted"
- The machine controller is erroring because the newly-created machine is not associated with any infra machine (
DigitaloceanMachine
). See https://github.dev/kubernetes-sigs/cluster-api/blob/00dbf7b9f6322d7ebd06ae2efa703b23354dd37d/internal/controllers/machine/machine_controller_phases.go#L101.
18:18:36.861993 1 machine_controller_phases.go:101] "could not find external ref, requeuing" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxcl6q6-hkbtj" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxcl6q6-hkbtj" reconcileID=89032bfa-b1c3-4551-a83d-4434d4fab5c6 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxcl6q6" Cluster="fleet-default/bruno1" DigitaloceanMachine="fleet-default/bruno1-pool1-af648dc8-k6jjd"
18:18:36.863625 1 controller.go:329] "Reconciler error" err="[MachineSet.cluster.x-k8s.io \"bruno1-pool1-598fbdc96dxcl6q6\" not found, machinesets.cluster.x-k8s.io \"bruno1-pool1-598fbdc96dxcl6q6\" not found]" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="fleet-default/bruno1-pool1-598fbdc96dxcl6q6" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxcl6q6" reconcileID=79055edb-235b-457f-ba5d-34657880a4db
- CAPI is still waiting for the infra machine (
DigitaloceanMachine
) for the original CAPI machine to be deleted.
18:18:36.897830 1 machine_controller.go:413] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=34dae327-e65b-4fd9-bb98-a6e9f4fb7917 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" DigitaloceanMachine="fleet-default/bruno1-pool1-af648dc8-gpjrk"
- Same error as before.
18:18:36.899317 1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxcl6q6-hkbtj" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxcl6q6-hkbtj" reconcileID=33332c51-5ff8-4727-9a0c-c0e66d8d409c MachineSet="fleet-default/bruno1-pool1-598fbdc96dxcl6q6" Cluster="fleet-default/bruno1" Node="" cause="cluster is being deleted"
18:18:36.921383 1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=841e8ca5-f316-4cff-aed6-bfeebf23bdc3 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" Node="bruno1-pool1-af648dc8-gpjrk" cause="cluster is being deleted"
18:18:36.946397 1 machine_controller.go:413] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=841e8ca5-f316-4cff-aed6-bfeebf23bdc3 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" DigitaloceanMachine="fleet-default/bruno1-pool1-af648dc8-gpjrk"
18:18:36.980253 1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=faa124f0-86d1-4465-b2c5-8992307a87f5 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" Node="bruno1-pool1-af648dc8-gpjrk" cause="cluster is being deleted"
- Still waiting.
18:18:37.023953 1 machine_controller.go:413] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=faa124f0-86d1-4465-b2c5-8992307a87f5 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" DigitaloceanMachine="fleet-default/bruno1-pool1-af648dc8-gpjrk"
- Same reconciler error.
18:18:37.033105 1 controller.go:329] "Reconciler error" err="machines.cluster.x-k8s.io \"bruno1-pool1-598fbdc96dxcl6q6-hkbtj\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxcl6q6-hkbtj" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxcl6q6-hkbtj" reconcileID=33332c51-5ff8-4727-9a0c-c0e66d8d409c
18:18:37.759150 1 controller.go:329] "Reconciler error" err="machinesets.cluster.x-k8s.io \"bruno1-pool1-598fbdc96dxb6jwb\" not found" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb" reconcileID=507bde1a-7b8e-40c5-8264-00bdb4c42905
18:18:41.524210 1 cluster_controller.go:298] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="fleet-default/bruno1" namespace="fleet-default" name="bruno1" reconcileID=077874dd-e00a-47f2-b8d8-2f0e58162aca controlPlaneRef="bruno1"
18:18:51.926860 1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/bruno1-pool1-598fbdc96dxb6jwb-v94d2" namespace="fleet-default" name="bruno1-pool1-598fbdc96dxb6jwb-v94d2" reconcileID=c2c364d1-0775-4f7a-b079-1ce4060ecb17 MachineSet="fleet-default/bruno1-pool1-598fbdc96dxb6jwb" Cluster="fleet-default/bruno1" Node="bruno1-pool1-af648dc8-gpjrk" cause="cluster is being deleted"
Label(s) to be applied
/kind bug
/area machineset
Activity