-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client hangs in case of unsatisfiable resources #8256
Comments
Hi, @sjdv1982! Thanks for reporting your issue. The behavior you describe is, in fact, the expected behavior. The reason is that Dask clusters are dynamic, and a new worker satisfying the resource requirements might join the cluster after the tasks have been submitted. One typical use case for this is when you are running in a Cloud environment where it might take a moment to acquire GPU instances. See also #7170. |
I've checked the documentation at https://distributed.dask.org/en/stable/resources.html and it looks like the expected behavior for this particular issue is not explained though. Would you be interested in submitting a PR that describes the behavior for (currently) unsatisfiable resource constraints? |
Hi @hendrikmakait, However, the
Thus, IMO, the current behaviour is correct for a Cloud cluster but inappropriate for |
The scheduling does not distinguish how dask was deployed and we don't have the information available at runtime whether or not new workers with additional resources can be added or not. I'm open to suggestions on how to improve this but I wouldn't want to couple cluster implementation specific knowledge to the scheduler. There are two things I could see happening
|
Hmm... in principle, I would suggest resource verification by the client. One could use the existing API However, to my surprise, it seems that resources are currently not stored in the with dask.config.set({"distributed.worker.resources.ncores": 3}):
cluster = LocalCluster(n_workers=1)
cluster.scale_up(2) the second worker does not have any resources attached to it! Is this correct and expected behaviour?? If yes, I would suggest to modify the docstring of I would prefer that SpecCluster should instead read the resources config at Thank you for your time. |
In the same vein, I was very surprised to see the following code hang. It seems to me the obvious way to set up a cluster for a 2-GPU machine. Am I doing something wrong? import dask
from dask.distributed import LocalCluster, Client
if __name__ == "__main__":
with dask.config.set({"distributed.worker.resources.GPU": 2}):
cluster = LocalCluster(n_workers=0, threads_per_worker=1)
cluster.adapt(minimum_jobs=0, maximum_jobs=2) # minimum_jobs=1 makes no difference
client = Client(cluster)
task = client.submit(lambda: 1, resources={"GPU": 1})
print(task.result()) # hangs |
Probably not correct but kind of expected. Since you are using the dask config contextmanager, the setting is reverted once the cluster is up but as you already found out, the cluster is not caching this. I think this should be fixed. Are you interested in contributing a patch for this? |
Sure, I will be happy to contribute a patch. |
Describe the issue:
When
client.submit
asks for resources that do not exist in that quantity (or that do not exist at all),client.result()
hangs forever. Expected behaviour: a Python exception.Minimal Complete Verifiable Example:
Anything else we need to know?:
I am rather a beginner to Dask. I wouldn't mind to write a bugfix for this.
Environment:
The text was updated successfully, but these errors were encountered: