Operator unable to delete Kubernetes Deployment 

There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.

**Reproducing steps:**
1. Install the Operator with `helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator`, ie [this quick start step](https://kubernetes.dask.org/en/latest/index.html#quickstart).
2. Create the cluster using the default yaml available from [this guide](https://kubernetes.dask.org/en/latest/operator_resources.html#daskcluster) as is. At this stage, two workers would be available from two deployments.
3. Create an autoscaler with the min workers set to 0 and install it

```
# autoscaler.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: simple
spec:
  cluster: "simple"
  minimum: 0  # we recommend always having a minimum of 1 worker so that an idle cluster can start working on tasks immediately
  maximum: 10 # you can place a hard limit on the number of workers regardless of what the scheduler requests
```
4. Apply this AutoScaler settings:
```
kubectl apply -f autoscaler.yaml
daskautoscaler.kubernetes.dask.org/simple created
```
5. At this stage, the operator would already try to remove some deployments, but it is attempting to delete a Deployment resouirce that matches the Pod name, which doesn't exist:
```
[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Autoscaler updated simple worker count from 2 to 1
[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-14 09:22:42,662] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/kubernetes.dask.org/v1/namespaces/default/daskclusters?fieldSelector=metadata.name%3Dsimple "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,668] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/apps/v1/namespaces/default/deployments?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,673] kopf.objects         [INFO    ] [default/simple-default] Scaled worker group simple-default up to 1 workers.
[2024-10-14 09:22:42,677] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,687] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,693] kopf.objects         [WARNING ] [default/simple-default] Scaling simple-default failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html.
[2024-10-14 09:22:42,697] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/pods?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,701] kopf.objects         [INFO    ] [default/simple-default] Workers to close: ['simple-default-worker-057ae426b6-79bcbdb84b-vlcn7']
[2024-10-14 09:22:42,705] httpx                [INFO    ] HTTP Request: DELETE https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 "HTTP/1.1 404 Not Found"
[2024-10-14 09:22:42,705] kopf.objects         [ERROR   ] [default/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 336, in delete
    async with self.api.call_api(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api
    raise ServerError(
kr8s._exceptions.ServerError: deployments.apps "simple-default-worker-057ae426b6-79bcbdb84b-vlcn7" not found
``` 

If I check the pods, the name `simple-default-worker-057ae426b6-79bcbdb84b-vlcn7` of the deployment it tried to delete indeed exists, but as a worker pod:
```
kubectl get pods -l dask.org/cluster-name=simple
NAME                                                READY   STATUS    RESTARTS   AGE
simple-default-worker-057ae426b6-79bcbdb84b-vlcn7   1/1     Running   0          9m36s
simple-default-worker-54afdedac5-6bdb8f746b-7lzsg   1/1     Running   0          9m36s
simple-scheduler-78db7fbfd8-zmwgr                   1/1     Running   0          9m36s
```

However, the deployment name that controls this pod has a different name:
```
kubectl get deployments -l dask.org/cluster-name=simple
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
simple-default-worker-057ae426b6   1/1     1            1           15m
simple-default-worker-54afdedac5   1/1     1            1           15m
simple-scheduler                   1/1     1            1           15m
```

As you can see, the deployment that controls that worker pod is actually named `simple-default-worker-057ae426b6` instead of `simple-default-worker-057ae426b6-79bcbdb84b-vlcn7`, so as a result, the operator is unable to delete the deployments and the workers are never deleted from the namespace. It could be coming from [this line](https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L709)here the deletion using worker name as expected Deployment name.

**Anything else we need to know?**:
This may be relate to #855

**Environment**:
- Dask version: 2024.9.1
- Python version: 3.11
- Operating System: Mac/Linux
- Install method (conda, pip, source): pip


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Operator unable to delete Kubernetes Deployment #910

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Operator unable to delete Kubernetes Deployment #910

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions