Skip to content

Operator unable to delete Kubernetes Deployment  #910

@thaisarcanjo-ow

Description

@thaisarcanjo-ow

There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.

Reproducing steps:

  1. Install the Operator with helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator, ie this quick start step.
  2. Create the cluster using the default yaml available from this guide as is. At this stage, two workers would be available from two deployments.
  3. Create an autoscaler with the min workers set to 0 and install it
# autoscaler.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: simple
spec:
  cluster: "simple"
  minimum: 0  # we recommend always having a minimum of 1 worker so that an idle cluster can start working on tasks immediately
  maximum: 10 # you can place a hard limit on the number of workers regardless of what the scheduler requests
  1. Apply this AutoScaler settings:
kubectl apply -f autoscaler.yaml
daskautoscaler.kubernetes.dask.org/simple created
  1. At this stage, the operator would already try to remove some deployments, but it is attempting to delete a Deployment resouirce that matches the Pod name, which doesn't exist:
[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Autoscaler updated simple worker count from 2 to 1
[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-14 09:22:42,662] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/kubernetes.dask.org/v1/namespaces/default/daskclusters?fieldSelector=metadata.name%3Dsimple "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,668] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/apps/v1/namespaces/default/deployments?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,673] kopf.objects         [INFO    ] [default/simple-default] Scaled worker group simple-default up to 1 workers.
[2024-10-14 09:22:42,677] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,687] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,693] kopf.objects         [WARNING ] [default/simple-default] Scaling simple-default failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html.
[2024-10-14 09:22:42,697] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/pods?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,701] kopf.objects         [INFO    ] [default/simple-default] Workers to close: ['simple-default-worker-057ae426b6-79bcbdb84b-vlcn7']
[2024-10-14 09:22:42,705] httpx                [INFO    ] HTTP Request: DELETE https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 "HTTP/1.1 404 Not Found"
[2024-10-14 09:22:42,705] kopf.objects         [ERROR   ] [default/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 336, in delete
    async with self.api.call_api(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api
    raise ServerError(
kr8s._exceptions.ServerError: deployments.apps "simple-default-worker-057ae426b6-79bcbdb84b-vlcn7" not found

If I check the pods, the name simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 of the deployment it tried to delete indeed exists, but as a worker pod:

kubectl get pods -l dask.org/cluster-name=simple
NAME                                                READY   STATUS    RESTARTS   AGE
simple-default-worker-057ae426b6-79bcbdb84b-vlcn7   1/1     Running   0          9m36s
simple-default-worker-54afdedac5-6bdb8f746b-7lzsg   1/1     Running   0          9m36s
simple-scheduler-78db7fbfd8-zmwgr                   1/1     Running   0          9m36s

However, the deployment name that controls this pod has a different name:

kubectl get deployments -l dask.org/cluster-name=simple
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
simple-default-worker-057ae426b6   1/1     1            1           15m
simple-default-worker-54afdedac5   1/1     1            1           15m
simple-scheduler                   1/1     1            1           15m

As you can see, the deployment that controls that worker pod is actually named simple-default-worker-057ae426b6 instead of simple-default-worker-057ae426b6-79bcbdb84b-vlcn7, so as a result, the operator is unable to delete the deployments and the workers are never deleted from the namespace. It could be coming from this linehere the deletion using worker name as expected Deployment name.

Anything else we need to know?:
This may be relate to #855

Environment:

  • Dask version: 2024.9.1
  • Python version: 3.11
  • Operating System: Mac/Linux
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions