Helm cluster by jacobtomlinson · Pull Request #255 · dask/dask-kubernetes

jacobtomlinson · 2020-06-02T12:06:21Z

I seem to find myself continuously explaining to people the difference between the Dask Helm chart and dask-kubernetes, and the fact that they are incompatible because of their underlying designs. It is not obvious, and we should fix that.

In an attempt to resolve this I've added a HelmCluster cluster manager to dask-kubernetes which provides a thin wrapper around an existing Dask cluster which has been deployed via the Helm Chart.

# Install the chart using Helm
helm repo add dask https://helm.dask.org
helm repo update

helm install dask/dask --name myrelease

# Connect to the cluster from a Python session and manually scale
from dask_kubernetes import HelmCluster

cluster = HelmCluster(release_name="myrelease")

cluster.scale(10)

from dask.distributed import Client
client = Client(cluster)

# Do some Dask work

cluster.scale(1)

Motivations

Provide Helm Chart users with a cluster manager that has some useful features such as being able to call cluster.scale(n) and cluster.get_logs() from within their Python or Jupyter session.
Update the documentation to provide a more clear explanation of the differences between KubeCluster and HelmCluster.
Have a documentation page which can be referred to when explaining to users the differences.

Challenges

While there isn't much code to HelmCluster there are a few design challenges around how the cluster manager works.

I have not made it possible to install the Helm Chart using HelmCluster
Users can only to connect to an existing deployment. This is because I do not want to encourage people to install the Helm chart this way. Helm charts should be installed and managed natively using helm. Handling config, lifecycle, upgrades, etc would add bloat to the class.

Scaling down the Helm Chart is not a safe operation
Kubernetes assumes all pods are stateless and equal. When scaling down a Dask cluster the scheduler will instruct empty/idle workers to exit. Or rearrange work in order to make empty/idle workers. Kubernetes will reduce the deployment count to the desired number, killing containers or starting containers to satisfy this. This results in a race condition where kubernetes may kill pods before they are ready, or restart them if Dask closes them too soon. To work around this I've added documentation instructing users only to scale down when they know it is safe

Adaptive scaling will be unstable
As scaling down is not safe this means that adaptive scaling will likely be unstable and result in clusters thrashing as futures are lost and recalculated. I have left adaptivity in just in case there are valid use cases that I haven't thought of, however, if you call cluster.adapt() you will get a warning against doing that altogether.

Outstanding tasks

Fix the CI (I imagine it will not handle the new helm tests)
Update the Kubernetes page in the Dask documentation (Add HelmCluster docs dask#6290)
Update the Helm Chart repo with a link to the HelmCluster docs (Add HelmCluster docs helm-chart#66)

Future work

Update the lab extension with the ability to configure and autocreate cluster object. This would be useful in the Jupyter that comes with the Helm Chart to already have the HelmCluster object listed.

…uster

raybellwaves · 2020-06-05T18:34:27Z

Thanks for working on this. Am I right that this closes #185?

jacobtomlinson · 2020-06-08T13:41:27Z

No problem @raybellwaves. Do you have any interest in testing this out and code reviewing?

Am I right that this closes #185?

That's a tough one. I'm tempted to say yes, but there is a little nuance here.

I had intended #185 to compliment #186. In #186 the idea would be to be able to detach from a running KubeCluster instance and connect to it again later. This PR is a little different in that you are connecting to a running Helm Cluster, which is not the same as a KubeCluster.

I think #186 would still be valuable, and we would need a way to attach too.

gforsyth

Hey @jacobtomlinson -- this looks like a nice way to bridge the gap between the two deployment strategies.

The docstring for HelmCluster implies that a user might run this on the jupyter notebook that is spun up by the helm chart, but I'm not sure that can work, since that will be within the kubernetes cluster itself and I don't know that it can connect kubernetes from there, can it?

I've tried this out on minikube and it can't find the release:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-841556317686> in <module>
----> 1 cluster = HelmCluster(release_name="dask")

/opt/conda/lib/python3.8/site-packages/dask_kubernetes/helm.py in __init__(self, release_name, auth, namespace, port_forward_cluster_ip, loop, asynchronous)
     86         )
     87         if status.returncode != 0:
---> 88             raise RuntimeError(f"No such helm release {self.release_name}.")
     89         self.auth = auth
     90         self.namespace

RuntimeError: No such helm release dask.

Port forwarding to svc/dask-scheduler, though, works great! I can scale up / scale down the workers using the HelmCluster object and pods are created and terminated accordingly.

I tried to call cluster.adapt() per your warning (where cluster is a HelmCluster) and got

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f4ff4407f40>>, <Task finished name='Task-1290' coro=<AdaptiveCore.adapt() done, defined at /opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py:170> exception=AttributeError("'HelmCluster' object has no attribute 'workers'")>)
Traceback (most recent call last):
  File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 183, in adapt
    recommendations = await self.recommendations(target)
  File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive.py", line 148, in recommendations
    if len(self.plan) != len(self.requested):
  File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive.py", line 116, in plan
    return self.cluster.plan
  File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 398, in plan
    return set(self.workers)
AttributeError: 'HelmCluster' object has no attribute 'workers'

on a never-ending loop.

gforsyth · 2020-06-09T20:09:24Z

dask_kubernetes/helm.py

+    port_forward_cluster_ip: bool (optional)
+        If the chart uses ClusterIP type services, forward the ports locally.
+        If you are using ``HelmCluster`` from the Jupyter session that was installed
+        by the helm chart this should be ``False``. If you are running it locally it should
+        be ``True``.


I don't think it's possible to query the status of a helm release if you're on a pod inside of that release, is it?

If you copy your local ~/.kube/config to the Jupyter pod you should be able to. Perhaps this should be documented?

Perhaps we should also consider adding a service role to the chart which allows modifying the worker deployment replica count and getting pod logs. Although that could have security implications.

jacobtomlinson · 2020-06-10T08:50:24Z

Thanks for the review @gforsyth

The docstring for HelmCluster implies that a user might run this on the jupyter notebook that is spun up by the helm chart, but I'm not sure that can work, since that will be within the kubernetes cluster itself and I don't know that it can connect kubernetes from there, can it?

If you copy your local ~/.kube/config to the Jupyter pod you should be able to. Perhaps this should be documented?

I tried to call cluster.adapt() per your warning (where cluster is a HelmCluster)

Hmm, that's frustrating. I'm hesitant to put the effort in here as adapting will not be a good experience with this cluster type. Perhaps instead of warning, it should raise a NotImplementedError and explicitly deny this behaviour.

gforsyth · 2020-06-10T10:11:37Z

Ahh, of course. I'll give that a shot.

I think an explicit not implemented is the way to go for the adapt issue.

…pyter

…uster

Timost · 2020-08-25T15:08:20Z

Unless I'm wrong, this implicitly makes helm a new mandatory dependency of dask-kubernetes, if that's the case, I think the doc should be updated to mention that.

jacobtomlinson added 14 commits June 2, 2020 12:00

Add HelmCluster

505b984

Merge branch 'master' of github.com:dask/dask-kubernetes into helm-cl…

5ab97c7

…uster

Rename logs to get_logs

f15180c

Cast log string to Log

f2d2b5b

Truncate line

7de3127

Assert that HelmCluster is a Cluster manager

c7e70fb

Add incluster support

7a7424f

Remove container cleanup

e583243

Attach to running container

5712e5d

Install helm in CI

ecc6245

Checkl helm

c2ec937

Install helm in container

e153626

Install curl in docker image

0ce91e0

Install helm chart repo in container

c29c669

This was referenced Jun 5, 2020

Add HelmCluster docs dask/helm-chart#66

Open

Add HelmCluster docs dask/dask#6290

Merged

gforsyth reviewed Jun 9, 2020

View reviewed changes

jacobtomlinson mentioned this pull request Aug 6, 2020

Allow Jupyter pod to scale workers dask/helm-chart#87

Open

jacobtomlinson added 4 commits August 6, 2020 17:21

Remove adapt option

ef88c38

Add warning and instruction on using HelmCluster from the provided Ju…

905755d

…pyter

Merge branch 'master' of github.com:dask/dask-kubernetes into helm-cl…

b98f9ca

…uster

Fix test

f4ae664

jacobtomlinson merged commit 89a3760 into dask:master Aug 7, 2020

jacobtomlinson deleted the helm-cluster branch August 7, 2020 09:29

jacobtomlinson mentioned this pull request Aug 7, 2020

Add service account and role to allow Jupyter pod to manage workers dask/helm-chart#88

Merged

jacobtomlinson mentioned this pull request Aug 26, 2020

Helm command is now a mandatory dependency #262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Helm cluster#255

Helm cluster#255
jacobtomlinson merged 18 commits intodask:masterfrom
jacobtomlinson:helm-cluster

jacobtomlinson commented Jun 2, 2020 •

edited

Loading

Uh oh!

raybellwaves commented Jun 5, 2020

Uh oh!

jacobtomlinson commented Jun 8, 2020

Uh oh!

gforsyth left a comment

Uh oh!

gforsyth Jun 9, 2020

Uh oh!

jacobtomlinson Jun 10, 2020

Uh oh!

jacobtomlinson commented Jun 10, 2020

Uh oh!

gforsyth commented Jun 10, 2020

Uh oh!

Timost commented Aug 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jacobtomlinson commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivations

Challenges

Outstanding tasks

Future work

Uh oh!

raybellwaves commented Jun 5, 2020

Uh oh!

jacobtomlinson commented Jun 8, 2020

Uh oh!

gforsyth left a comment

Choose a reason for hiding this comment

Uh oh!

gforsyth Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

jacobtomlinson Jun 10, 2020

Choose a reason for hiding this comment

Uh oh!

jacobtomlinson commented Jun 10, 2020

Uh oh!

gforsyth commented Jun 10, 2020

Uh oh!

Timost commented Aug 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jacobtomlinson commented Jun 2, 2020 •

edited

Loading