Skip to content

Conversation

jacobtomlinson
Copy link
Member

@jacobtomlinson jacobtomlinson commented Mar 10, 2023

Adds an optional idleTimeout config option to the DaskCluster spec which instructs the controller to delete idle clusters after a certain timeout.

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: foo
spec:
  idleTimeout: 60
  scheduler:
    ...
  worker:
    ...

Exposed via the Python API as the idle_timeout kwarg.

from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="foo", n_workers=1, idle_timeout=60)

In the above examples if the scheduler is idle for more than 60 seconds the DaskCluster resource will be deleted automatically by the controller. This is regardless of whether the KubeCluster object still exists in Python.

There is a challenge with this implementation which means I don't want to merge it in its current state Solved with dask/distributed#7642

Closes #667

@jacobtomlinson
Copy link
Member Author

I wonder if another API here could be:

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: foo
spec:
  idleTimeout: 60
  idleAction: "terminate"  # Or could be "scale" to just scale worker groups to zero
  scheduler:
    ...
  worker:
    ...
from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="foo", n_workers=1, idle_timeout=60, idle_action="scale")

@jacobtomlinson
Copy link
Member Author

jacobtomlinson commented Mar 13, 2023

On second thoughts idle_action="scale" is pretty similar to adaptive scaling so probably not the best road to go down. Although renaming the spec item to idleTimeout makes sense.

@jacobtomlinson
Copy link
Member Author

I found a bug in distributed which is breaking this. Raised dask/distributed#7781 to resolve.

@jacobtomlinson
Copy link
Member Author

Tested this PR again today now that dask/distributed#7781 has been merged and can confirm that it works as expected.

from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="idle", n_workers=10, idle_timeout=120, env={"EXTRA_PIP_PACKAGES": "git+https://github.com/dask/distributed.git"})

# Cluster is deleted automatically after 2 mins

This PR will have to wait until the next Dask release but then we should be good to go.

@vitawasalreadytaken
Copy link

Hi, may I ask if there's an ETA for merging and releasing this change? We were so happy to find that you were working on this and now we're really looking forward to this feature 😊 Thank you so much for your great work!

@jacobtomlinson
Copy link
Member Author

The release we were waiting for happened last Friday, I just haven't had time to circle back here. Probably will do next week.

@HynekBlaha
Copy link

HynekBlaha commented May 15, 2023

Hello @jacobtomlinson, we are also very excited about this feature. :)
Would you mind sharing what the missing pieces are? From an external watcher perspective it looks solid. :)
Is there any meaningful work I could help you with on this feature?
Thanks!

@jacobtomlinson
Copy link
Member Author

Thanks @HynekBlaha but we were just waiting on an upstream release to fix a bug. I think we should be good so I've triggered CI here again, hopefully it will pass now.

@HynekBlaha
Copy link

Amazing :) 🤞

@preneond
Copy link

preneond commented May 30, 2023

@jacobtomlinson Hi 👋, can we get some estimate when it will be done?

@Artimi
Copy link
Contributor

Artimi commented Jun 1, 2023

Hello, thanks for your work on this feature. We're looking forward to it.

I was trying to test it. This is my setup:

  • Built dask kubernetes operator image from this branch, deployed it to kubernetes
  • Using dask and distributed version 2023.3.2
  • Set idle_timeout to 10 both to KubeCluster init and cluster_spec which KubeCluster takes as an argument.
  • Deploy a dask cluster which works about a minute and then stops

Findings:

  1. When I start the cluster I can see this error in dask operator log
[2023-06-01 07:09:45,811] kopf.objects         [ERROR   ] [ana-ixian/ixian-petrsebek-36-violent-otheyms] Timer 'daskcluster_autoshutdown' failed with an exception. Will retry.
Traceback (most recent call last):                                                                                                                                                                          File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(                                                                                                                                                                          File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(                                                                                                                                                                       File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke                                                                                                         result = await fn(**kwargs)  # type: ignore                                                                                                                                                             File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/operator/controller/controller.py", line 936, in daskcluster_autoshutdown                                                                      idle_since = await check_scheduler_idle(                                                                                                                                                                File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/operator/controller/controller.py", line 433, in check_scheduler_idle
    dashboard_address = await get_scheduler_address(                                                                                                                                                        File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/common/networking.py", line 177, in get_scheduler_address
    service = await api.read_namespaced_service(service_name, namespace)                                                                                                                                    File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
    raise e                                                                                                                                                                                                 File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(                                                                                                                                                                     File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET                                                                                                           return (await self.request("GET", url,                                                                                                                                                                  File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 187, in request                                                                                                       raise ApiException(http_resp=r)                                                                                                                                                                       kubernetes_asyncio.client.exceptions.ApiException: (404)
Reason: Not Found                                                                                                                                                                                         HTTP response headers: <CIMultiDictProxy('Audit-Id': 'e926835b-5361-4f25-b05a-6fedae1d547e', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '
a3ef744b-c3e6-4bf2-9c05-b8decfa0b75a', 'X-Kubernetes-Pf-Prioritylevel-Uid': '2d88b8ba-ee58-4770-a7b4-2342c139b0cc', 'Date': 'Thu, 01 Jun 2023 07:09:45 GMT', 'Content-Length': '264')>                    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services \"ixian-petrsebek-36-violent-otheyms-scheduler\" not found","reason":"NotFound","details":{"na
me":"ixian-petrsebek-36-violent-otheyms-scheduler","kind":"services"},"code":404}

This is probably because we are trying to get the scheduler address before the dask cluster is started. It is not a big issue but it'd be nice to show it to user because it can be misleading.

  1. After the dask cluster is running I can see that the daskcluster_autoshutdown is called repeatedly. However, the dask cluster is not stopped after it finishes its work and no new jobs are assigned. I can see this log
[2023-06-01 07:18:23,624] kopf.objects         [INFO    ] [ana-ixian/ixian-petrsebek-36-violent-otheyms] Checking ixian-petrsebek-36-violent-otheyms-scheduler idleness failed via the HTTP API, falling back to the Dask RPC                                                                                                                                                                                       [2023-06-01 07:18:23,636] kopf.objects         [INFO    ] [ana-ixian/ixian-petrsebek-36-violent-otheyms] Timer 'daskcluster_autoshutdown' succeeded.

So it seems that the HTTP API method fails, but the fallback method using Dask RPC is successful. But the cluster is never shut down nonetheless.

  1. When I port-forward the port 8786 and I try to do the HTTP API request it returns an empty pickle file instead of some json with idle time as I'd expect. Am I doing it right?
$ curl --http0.9 http://localhost:8786/api/v1/check_idle --output check_idle
$ xxd check_idle
00000000: 3a00 0000 0000 0000 0100 0000 0000 0000  :...............
00000010: 2a00 0000 0000 0000 83ab 636f 6d70 7265  *.........compre
00000020: 7373 696f 6ec0 a670 7974 686f 6e93 030b  ssion..python...
00000030: 03af 7069 636b 6c65 2d70 726f 746f 636f  ..pickle-protoco
00000040: 6c05                                     l.

@jacobtomlinson
Copy link
Member Author

Thanks @Artimi.

When I start the cluster I can see this error in dask operator log

It would be nice to clean that up. The controller will automatically retry but it's untidy and we should fix that.

but the fallback method using Dask RPC is successful

The HTTP API is not enabled by default, so falling back to the RPC API is expected.

When I port-forward the port 8786 and I try to do the HTTP API request it returns an empty pickle file instead of some json with idle time as I'd expect. Am I doing it right?

The scheduler has two ports, it looks like you've forwarded the TCP comm instead of the web server.

@jacobtomlinson
Copy link
Member Author

Ok it looks like I've managed to correct the CI problems that were blocking this PR. Tests are passing which gives me confidence we can merge this. I'll give this another look over at the start of next week and check that it is working as expected and then hopefully hit the green button.

@jacobtomlinson
Copy link
Member Author

I've run through and tested this locally again. Things seem to be working as expected and I'm seeing clusters get cleaned up.

I made a couple of tweaks to the logging to reduce noise, but I think this should be good to merge on passing CI.

@jacobtomlinson jacobtomlinson marked this pull request as ready for review June 12, 2023 11:08
@jacobtomlinson jacobtomlinson changed the title POC Auto shutdown idle clusters Auto shutdown idle clusters Jun 12, 2023
@jacobtomlinson jacobtomlinson merged commit 68d6c96 into dask:main Jun 12, 2023
@jacobtomlinson jacobtomlinson deleted the autocleanup branch June 12, 2023 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add idle cluster cleanup

7 participants