-
-
Notifications
You must be signed in to change notification settings - Fork 155
Description
There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.
Reproducing steps:
- Install the Operator with
helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator
, ie this quick start step. - Create the cluster using the default yaml available from this guide as is. At this stage, two workers would be available from two deployments.
- Create an autoscaler with the min workers set to 0 and install it
# autoscaler.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
name: simple
spec:
cluster: "simple"
minimum: 0 # we recommend always having a minimum of 1 worker so that an idle cluster can start working on tasks immediately
maximum: 10 # you can place a hard limit on the number of workers regardless of what the scheduler requests
- Apply this AutoScaler settings:
kubectl apply -f autoscaler.yaml
daskautoscaler.kubernetes.dask.org/simple created
- At this stage, the operator would already try to remove some deployments, but it is attempting to delete a Deployment resouirce that matches the Pod name, which doesn't exist:
[2024-10-14 09:22:42,559] kopf.objects [INFO ] [default/simple] Autoscaler updated simple worker count from 2 to 1
[2024-10-14 09:22:42,559] kopf.objects [INFO ] [default/simple] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-14 09:22:42,662] httpx [INFO ] HTTP Request: GET https://10.96.0.1/apis/kubernetes.dask.org/v1/namespaces/default/daskclusters?fieldSelector=metadata.name%3Dsimple "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,668] httpx [INFO ] HTTP Request: GET https://10.96.0.1/apis/apps/v1/namespaces/default/deployments?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,673] kopf.objects [INFO ] [default/simple-default] Scaled worker group simple-default up to 1 workers.
[2024-10-14 09:22:42,677] httpx [INFO ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,687] httpx [INFO ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,693] kopf.objects [WARNING ] [default/simple-default] Scaling simple-default failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html.
[2024-10-14 09:22:42,697] httpx [INFO ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/pods?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,701] kopf.objects [INFO ] [default/simple-default] Workers to close: ['simple-default-worker-057ae426b6-79bcbdb84b-vlcn7']
[2024-10-14 09:22:42,705] httpx [INFO ] HTTP Request: DELETE https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 "HTTP/1.1 404 Not Found"
[2024-10-14 09:22:42,705] kopf.objects [ERROR ] [default/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api
response.raise_for_status()
File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 336, in delete
async with self.api.call_api(
File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api
raise ServerError(
kr8s._exceptions.ServerError: deployments.apps "simple-default-worker-057ae426b6-79bcbdb84b-vlcn7" not found
If I check the pods, the name simple-default-worker-057ae426b6-79bcbdb84b-vlcn7
of the deployment it tried to delete indeed exists, but as a worker pod:
kubectl get pods -l dask.org/cluster-name=simple
NAME READY STATUS RESTARTS AGE
simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 1/1 Running 0 9m36s
simple-default-worker-54afdedac5-6bdb8f746b-7lzsg 1/1 Running 0 9m36s
simple-scheduler-78db7fbfd8-zmwgr 1/1 Running 0 9m36s
However, the deployment name that controls this pod has a different name:
kubectl get deployments -l dask.org/cluster-name=simple
NAME READY UP-TO-DATE AVAILABLE AGE
simple-default-worker-057ae426b6 1/1 1 1 15m
simple-default-worker-54afdedac5 1/1 1 1 15m
simple-scheduler 1/1 1 1 15m
As you can see, the deployment that controls that worker pod is actually named simple-default-worker-057ae426b6
instead of simple-default-worker-057ae426b6-79bcbdb84b-vlcn7
, so as a result, the operator is unable to delete the deployments and the workers are never deleted from the namespace. It could be coming from this linehere the deletion using worker name as expected Deployment name.
Anything else we need to know?:
This may be relate to #855
Environment:
- Dask version: 2024.9.1
- Python version: 3.11
- Operating System: Mac/Linux
- Install method (conda, pip, source): pip