-
-
Notifications
You must be signed in to change notification settings - Fork 155
Auto shutdown idle clusters #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I wonder if another API here could be: apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
name: foo
spec:
idleTimeout: 60
idleAction: "terminate" # Or could be "scale" to just scale worker groups to zero
scheduler:
...
worker:
... from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="foo", n_workers=1, idle_timeout=60, idle_action="scale") |
On second thoughts |
I found a bug in distributed which is breaking this. Raised dask/distributed#7781 to resolve. |
Tested this PR again today now that dask/distributed#7781 has been merged and can confirm that it works as expected. from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="idle", n_workers=10, idle_timeout=120, env={"EXTRA_PIP_PACKAGES": "git+https://github.com/dask/distributed.git"})
# Cluster is deleted automatically after 2 mins This PR will have to wait until the next Dask release but then we should be good to go. |
Hi, may I ask if there's an ETA for merging and releasing this change? We were so happy to find that you were working on this and now we're really looking forward to this feature 😊 Thank you so much for your great work! |
The release we were waiting for happened last Friday, I just haven't had time to circle back here. Probably will do next week. |
Hello @jacobtomlinson, we are also very excited about this feature. :) |
Thanks @HynekBlaha but we were just waiting on an upstream release to fix a bug. I think we should be good so I've triggered CI here again, hopefully it will pass now. |
Amazing :) 🤞 |
@jacobtomlinson Hi 👋, can we get some estimate when it will be done? |
Hello, thanks for your work on this feature. We're looking forward to it. I was trying to test it. This is my setup:
Findings:
This is probably because we are trying to get the scheduler address before the dask cluster is started. It is not a big issue but it'd be nice to show it to user because it can be misleading.
So it seems that the HTTP API method fails, but the fallback method using Dask RPC is successful. But the cluster is never shut down nonetheless.
|
Thanks @Artimi.
It would be nice to clean that up. The controller will automatically retry but it's untidy and we should fix that.
The HTTP API is not enabled by default, so falling back to the RPC API is expected.
The scheduler has two ports, it looks like you've forwarded the TCP comm instead of the web server. |
Ok it looks like I've managed to correct the CI problems that were blocking this PR. Tests are passing which gives me confidence we can merge this. I'll give this another look over at the start of next week and check that it is working as expected and then hopefully hit the green button. |
I've run through and tested this locally again. Things seem to be working as expected and I'm seeing clusters get cleaned up. I made a couple of tweaks to the logging to reduce noise, but I think this should be good to merge on passing CI. |
Adds an optional
idleTimeout
config option to theDaskCluster
spec which instructs the controller to delete idle clusters after a certain timeout.Exposed via the Python API as the
idle_timeout
kwarg.In the above examples if the scheduler is idle for more than 60 seconds the
DaskCluster
resource will be deleted automatically by the controller. This is regardless of whether theKubeCluster
object still exists in Python.There is a challenge with this implementation which means I don't want to merge it in its current stateSolved with dask/distributed#7642Closes #667