Conversation
|
Thanks for working on this. Am I right that this closes #185? |
|
No problem @raybellwaves. Do you have any interest in testing this out and code reviewing?
That's a tough one. I'm tempted to say yes, but there is a little nuance here. I had intended #185 to compliment #186. In #186 the idea would be to be able to detach from a running I think #186 would still be valuable, and we would need a way to attach too. |
gforsyth
left a comment
There was a problem hiding this comment.
Hey @jacobtomlinson -- this looks like a nice way to bridge the gap between the two deployment strategies.
The docstring for HelmCluster implies that a user might run this on the jupyter notebook that is spun up by the helm chart, but I'm not sure that can work, since that will be within the kubernetes cluster itself and I don't know that it can connect kubernetes from there, can it?
I've tried this out on minikube and it can't find the release:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-3-841556317686> in <module>
----> 1 cluster = HelmCluster(release_name="dask")
/opt/conda/lib/python3.8/site-packages/dask_kubernetes/helm.py in __init__(self, release_name, auth, namespace, port_forward_cluster_ip, loop, asynchronous)
86 )
87 if status.returncode != 0:
---> 88 raise RuntimeError(f"No such helm release {self.release_name}.")
89 self.auth = auth
90 self.namespace
RuntimeError: No such helm release dask.
Port forwarding to svc/dask-scheduler, though, works great! I can scale up / scale down the workers using the HelmCluster object and pods are created and terminated accordingly.
I tried to call cluster.adapt() per your warning (where cluster is a HelmCluster) and got
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f4ff4407f40>>, <Task finished name='Task-1290' coro=<AdaptiveCore.adapt() done, defined at /opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py:170> exception=AttributeError("'HelmCluster' object has no attribute 'workers'")>)
Traceback (most recent call last):
File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 183, in adapt
recommendations = await self.recommendations(target)
File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive.py", line 148, in recommendations
if len(self.plan) != len(self.requested):
File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/adaptive.py", line 116, in plan
return self.cluster.plan
File "/opt/miniconda3/envs/daskkube2/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 398, in plan
return set(self.workers)
AttributeError: 'HelmCluster' object has no attribute 'workers'
on a never-ending loop.
| port_forward_cluster_ip: bool (optional) | ||
| If the chart uses ClusterIP type services, forward the ports locally. | ||
| If you are using ``HelmCluster`` from the Jupyter session that was installed | ||
| by the helm chart this should be ``False``. If you are running it locally it should | ||
| be ``True``. |
There was a problem hiding this comment.
I don't think it's possible to query the status of a helm release if you're on a pod inside of that release, is it?
There was a problem hiding this comment.
If you copy your local ~/.kube/config to the Jupyter pod you should be able to. Perhaps this should be documented?
Perhaps we should also consider adding a service role to the chart which allows modifying the worker deployment replica count and getting pod logs. Although that could have security implications.
|
Thanks for the review @gforsyth
If you copy your local ~/.kube/config to the Jupyter pod you should be able to. Perhaps this should be documented?
Hmm, that's frustrating. I'm hesitant to put the effort in here as adapting will not be a good experience with this cluster type. Perhaps instead of warning, it should raise a |
|
Ahh, of course. I'll give that a shot. I think an explicit not implemented is the way to go for the adapt issue. |
|
Unless I'm wrong, this implicitly makes |
I seem to find myself continuously explaining to people the difference between the Dask Helm chart and dask-kubernetes, and the fact that they are incompatible because of their underlying designs. It is not obvious, and we should fix that.
In an attempt to resolve this I've added a
HelmClustercluster manager to dask-kubernetes which provides a thin wrapper around an existing Dask cluster which has been deployed via the Helm Chart.# Install the chart using Helm helm repo add dask https://helm.dask.org helm repo update helm install dask/dask --name myreleaseMotivations
cluster.scale(n)andcluster.get_logs()from within their Python or Jupyter session.KubeClusterandHelmCluster.Challenges
While there isn't much code to
HelmClusterthere are a few design challenges around how the cluster manager works.I have not made it possible to install the Helm Chart using
HelmClusterUsers can only to connect to an existing deployment. This is because I do not want to encourage people to install the Helm chart this way. Helm charts should be installed and managed natively using
helm. Handling config, lifecycle, upgrades, etc would add bloat to the class.Scaling down the Helm Chart is not a safe operation
Kubernetes assumes all pods are stateless and equal. When scaling down a Dask cluster the scheduler will instruct empty/idle workers to exit. Or rearrange work in order to make empty/idle workers. Kubernetes will reduce the deployment count to the desired number, killing containers or starting containers to satisfy this. This results in a race condition where kubernetes may kill pods before they are ready, or restart them if Dask closes them too soon. To work around this I've added documentation instructing users only to scale down when they know it is safe
Adaptive scaling will be unstable
As scaling down is not safe this means that adaptive scaling will likely be unstable and result in clusters thrashing as futures are lost and recalculated. I have left adaptivity in just in case there are valid use cases that I haven't thought of, however, if you call
cluster.adapt()you will get a warning against doing that altogether.Outstanding tasks
HelmClusterdocs (Add HelmCluster docs helm-chart#66)Future work
HelmClusterobject listed.