Auto shutdown idle clusters #672

jacobtomlinson · 2023-03-10T12:58:02Z

Adds an optional idleTimeout config option to the DaskCluster spec which instructs the controller to delete idle clusters after a certain timeout.

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: foo
spec:
  idleTimeout: 60
  scheduler:
    ...
  worker:
    ...

Exposed via the Python API as the idle_timeout kwarg.

from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="foo", n_workers=1, idle_timeout=60)

In the above examples if the scheduler is idle for more than 60 seconds the DaskCluster resource will be deleted automatically by the controller. This is regardless of whether the KubeCluster object still exists in Python.

~~There is a challenge with this implementation which means I don't want to merge it in its current state~~ Solved with dask/distributed#7642

Closes #667

jacobtomlinson · 2023-03-10T17:07:56Z

I wonder if another API here could be:

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: foo
spec:
  idleTimeout: 60
  idleAction: "terminate"  # Or could be "scale" to just scale worker groups to zero
  scheduler:
    ...
  worker:
    ...

from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="foo", n_workers=1, idle_timeout=60, idle_action="scale")

jacobtomlinson · 2023-03-13T17:40:51Z

On second thoughts idle_action="scale" is pretty similar to adaptive scaling so probably not the best road to go down. Although renaming the spec item to idleTimeout makes sense.

dask_kubernetes/operator/controller/controller.py

jacobtomlinson · 2023-04-17T12:36:00Z

I found a bug in distributed which is breaking this. Raised dask/distributed#7781 to resolve.

jacobtomlinson · 2023-04-19T11:07:44Z

Tested this PR again today now that dask/distributed#7781 has been merged and can confirm that it works as expected.

from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="idle", n_workers=10, idle_timeout=120, env={"EXTRA_PIP_PACKAGES": "git+https://github.com/dask/distributed.git"})

# Cluster is deleted automatically after 2 mins

This PR will have to wait until the next Dask release but then we should be good to go.

vitawasalreadytaken · 2023-05-03T12:41:59Z

Hi, may I ask if there's an ETA for merging and releasing this change? We were so happy to find that you were working on this and now we're really looking forward to this feature 😊 Thank you so much for your great work!

jacobtomlinson · 2023-05-04T14:30:43Z

The release we were waiting for happened last Friday, I just haven't had time to circle back here. Probably will do next week.

HynekBlaha · 2023-05-15T12:00:40Z

Hello @jacobtomlinson, we are also very excited about this feature. :)
Would you mind sharing what the missing pieces are? From an external watcher perspective it looks solid. :)
Is there any meaningful work I could help you with on this feature?
Thanks!

jacobtomlinson · 2023-05-15T14:50:12Z

Thanks @HynekBlaha but we were just waiting on an upstream release to fix a bug. I think we should be good so I've triggered CI here again, hopefully it will pass now.

HynekBlaha · 2023-05-15T14:59:19Z

Amazing :) 🤞

preneond · 2023-05-30T10:37:21Z

@jacobtomlinson Hi 👋, can we get some estimate when it will be done?

Artimi · 2023-06-01T07:33:07Z

Hello, thanks for your work on this feature. We're looking forward to it.

I was trying to test it. This is my setup:

Built dask kubernetes operator image from this branch, deployed it to kubernetes
Using dask and distributed version 2023.3.2
Set idle_timeout to 10 both to KubeCluster init and cluster_spec which KubeCluster takes as an argument.
Deploy a dask cluster which works about a minute and then stops

Findings:

When I start the cluster I can see this error in dask operator log

[2023-06-01 07:09:45,811] kopf.objects         [ERROR   ] [ana-ixian/ixian-petrsebek-36-violent-otheyms] Timer 'daskcluster_autoshutdown' failed with an exception. Will retry.
Traceback (most recent call last):                                                                                                                                                                          File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(                                                                                                                                                                          File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(                                                                                                                                                                       File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke                                                                                                         result = await fn(**kwargs)  # type: ignore                                                                                                                                                             File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/operator/controller/controller.py", line 936, in daskcluster_autoshutdown                                                                      idle_since = await check_scheduler_idle(                                                                                                                                                                File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/operator/controller/controller.py", line 433, in check_scheduler_idle
    dashboard_address = await get_scheduler_address(                                                                                                                                                        File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/common/networking.py", line 177, in get_scheduler_address
    service = await api.read_namespaced_service(service_name, namespace)                                                                                                                                    File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
    raise e                                                                                                                                                                                                 File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(                                                                                                                                                                     File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET                                                                                                           return (await self.request("GET", url,                                                                                                                                                                  File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 187, in request                                                                                                       raise ApiException(http_resp=r)                                                                                                                                                                       kubernetes_asyncio.client.exceptions.ApiException: (404)
Reason: Not Found                                                                                                                                                                                         HTTP response headers: <CIMultiDictProxy('Audit-Id': 'e926835b-5361-4f25-b05a-6fedae1d547e', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '
a3ef744b-c3e6-4bf2-9c05-b8decfa0b75a', 'X-Kubernetes-Pf-Prioritylevel-Uid': '2d88b8ba-ee58-4770-a7b4-2342c139b0cc', 'Date': 'Thu, 01 Jun 2023 07:09:45 GMT', 'Content-Length': '264')>                    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services \"ixian-petrsebek-36-violent-otheyms-scheduler\" not found","reason":"NotFound","details":{"na
me":"ixian-petrsebek-36-violent-otheyms-scheduler","kind":"services"},"code":404}

This is probably because we are trying to get the scheduler address before the dask cluster is started. It is not a big issue but it'd be nice to show it to user because it can be misleading.

After the dask cluster is running I can see that the daskcluster_autoshutdown is called repeatedly. However, the dask cluster is not stopped after it finishes its work and no new jobs are assigned. I can see this log

[2023-06-01 07:18:23,624] kopf.objects         [INFO    ] [ana-ixian/ixian-petrsebek-36-violent-otheyms] Checking ixian-petrsebek-36-violent-otheyms-scheduler idleness failed via the HTTP API, falling back to the Dask RPC                                                                                                                                                                                       [2023-06-01 07:18:23,636] kopf.objects         [INFO    ] [ana-ixian/ixian-petrsebek-36-violent-otheyms] Timer 'daskcluster_autoshutdown' succeeded.

So it seems that the HTTP API method fails, but the fallback method using Dask RPC is successful. But the cluster is never shut down nonetheless.

When I port-forward the port 8786 and I try to do the HTTP API request it returns an empty pickle file instead of some json with idle time as I'd expect. Am I doing it right?

$ curl --http0.9 http://localhost:8786/api/v1/check_idle --output check_idle
$ xxd check_idle
00000000: 3a00 0000 0000 0000 0100 0000 0000 0000  :...............
00000010: 2a00 0000 0000 0000 83ab 636f 6d70 7265  *.........compre
00000020: 7373 696f 6ec0 a670 7974 686f 6e93 030b  ssion..python...
00000030: 03af 7069 636b 6c65 2d70 726f 746f 636f  ..pickle-protoco
00000040: 6c05                                     l.

jacobtomlinson · 2023-06-09T15:47:58Z

Thanks @Artimi.

When I start the cluster I can see this error in dask operator log

It would be nice to clean that up. The controller will automatically retry but it's untidy and we should fix that.

but the fallback method using Dask RPC is successful

The HTTP API is not enabled by default, so falling back to the RPC API is expected.

When I port-forward the port 8786 and I try to do the HTTP API request it returns an empty pickle file instead of some json with idle time as I'd expect. Am I doing it right?

The scheduler has two ports, it looks like you've forwarded the TCP comm instead of the web server.

jacobtomlinson · 2023-06-09T16:47:25Z

Ok it looks like I've managed to correct the CI problems that were blocking this PR. Tests are passing which gives me confidence we can merge this. I'll give this another look over at the start of next week and check that it is working as expected and then hopefully hit the green button.

…eduler

jacobtomlinson · 2023-06-12T11:07:50Z

I've run through and tested this locally again. Things seem to be working as expected and I'm seeing clusters get cleaned up.

I made a couple of tweaks to the logging to reduce noise, but I think this should be good to merge on passing CI.

jacobtomlinson added 2 commits March 10, 2023 11:53

Add autoCleanupTimeout option

49f2165

Add idle_timeout

7a86497

jacobtomlinson mentioned this pull request Mar 10, 2023

Expose scheduler idle via RPC and HTTP API dask/distributed#7642

Merged

2 tasks

jacobtomlinson added 2 commits March 13, 2023 17:21

Merge branch 'main' of github.com:dask/dask-kubernetes into autocleanup

caf30f3

Update to use dask/distributed#7642

762a1ed

Rename spec ketyword

0ed087f

gjoseph92 reviewed Mar 21, 2023

View reviewed changes

dask_kubernetes/operator/controller/controller.py Outdated Show resolved Hide resolved

HynekBlahaDev mentioned this pull request Apr 14, 2023

Add idle cluster cleanup #667

Closed

jacobtomlinson mentioned this pull request Apr 14, 2023

Dask Kubernetes v2 (Stability) Release rapidsai/deployment#216

Closed

haddon-korinek-ent reviewed Apr 14, 2023

View reviewed changes

dask_kubernetes/operator/controller/controller.py Show resolved Hide resolved

Updated with changes in distributed

1c52918

Nudge CI

e050514

jacobtomlinson added 4 commits May 18, 2023 11:40

Merge branch 'main' of github.com:dask/dask-kubernetes into autocleanup

9c66d1a

Set idle_timeout longer than resource_timeout

acf41c5

Rename cluster to avoid collision

96f872e

Fix timeout default value

8c448aa

Merge branch 'main' of github.com:dask/dask-kubernetes into autocleanup

0f11442

Reduce logging noise and add warning when autoshudown can't reach sch…

0a4777a

…eduler

jacobtomlinson marked this pull request as ready for review June 12, 2023 11:08

jacobtomlinson changed the title ~~POC Auto shutdown idle clusters~~ Auto shutdown idle clusters Jun 12, 2023

jacobtomlinson merged commit 68d6c96 into dask:main Jun 12, 2023

jacobtomlinson deleted the autocleanup branch June 12, 2023 11:24

Alputer mentioned this pull request Dec 9, 2024

Docs missing idleTimeout parameter #922

Closed

jacobtomlinson mentioned this pull request Dec 9, 2024

Add DaskCluster.idleTimeout to the docs #923

Merged

vladidobro mentioned this pull request Aug 4, 2025

[Feature] Terminate idle cluster ray-project/kuberay#2998

Open

2 tasks

Uh oh!

Auto shutdown idle clusters #672

Auto shutdown idle clusters #672

Uh oh!

Conversation

jacobtomlinson commented Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson commented Mar 10, 2023

Uh oh!

jacobtomlinson commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jacobtomlinson commented Apr 17, 2023

Uh oh!

jacobtomlinson commented Apr 19, 2023

Uh oh!

vitawasalreadytaken commented May 3, 2023

Uh oh!

jacobtomlinson commented May 4, 2023

Uh oh!

HynekBlaha commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson commented May 15, 2023

Uh oh!

HynekBlaha commented May 15, 2023

Uh oh!

preneond commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artimi commented Jun 1, 2023

Uh oh!

jacobtomlinson commented Jun 9, 2023

Uh oh!

jacobtomlinson commented Jun 9, 2023

Uh oh!

jacobtomlinson commented Jun 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jacobtomlinson commented Mar 10, 2023 •

edited

Loading

jacobtomlinson commented Mar 13, 2023 •

edited

Loading

HynekBlaha commented May 15, 2023 •

edited

Loading

preneond commented May 30, 2023 •

edited

Loading