-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple workers per VM #173
Comments
I think for this to work we need to change: dask-cloudprovider/dask_cloudprovider/gcp/instances.py Lines 318 to 330 in d5dfd99
To handle a spec which looks like the following: workers_options = {
'worker-1': {"cls": "dask.distributed.Nanny", "opts": {"nthreads": 1}},
'worker-2': {"cls": "dask.distributed.Nanny", "opts": {"nthreads": 2}},
} |
When running on a local machine Perhaps we can reuse this logic? |
You mean, have dask-cloudprovider inspect the system and assign worker/threads accordingly ? |
Yeah. We do this for |
dask/distributed#4377 allows us to detect and create worker/threads automatically.
Once a release is out we can update things here to use this new option. |
any progress here? |
@eric-valente with the latest release of If you could give it a go and report back it would be much appreciated! |
Hi @jacobtomlinson Tried it with packages and no luck. The workers will turn on in EC2 and won't connect to the scheduler and then proceed to die. When I remove the nprocs it works fine but again only has 1 worker per VM with threads = # of cores worker_options={'nprocs':'auto'} dask 2021.4.1 pyhd8ed1ab_0 conda-forge |
Strange! Are you able to get any logs from the EC2 instances? They tend to get dumped in |
Note, this works if i simply remove nprocs from the worker_options. Thanks for your help! |
Thanks @eric-valente. You also need to set the The traceback you got above looks like a bug, I'll raise that back in distributed. @quasiben what do you think about making |
Thanks again for your help here @jacobtomlinson Tried setting the worker class to Nanny but still same issue: |
dask/distributed#4640 seems maybe related |
Seems dask.distributed.Nanny does not accept nprocs either Fails with above init error: Works: |
Seems like the cloud-init to add a worker uses this style of starting a worker: vs. this style: |
+1 to making |
@jacobtomlinson Yeah it seems like EC2Cluster uses python -m distributed.cli.dask_spec and passing in worker_class As suggested above, I think you might need to accept multiiple workers defined in worker_options and skip nprocs?
I think I could use worker_module but it is not a valid parameter for EC2Cluster: |
It should be, have you tried it? |
Hello
Where n is the number of CPU's on that EC2. It would be great if this can be done by passing nprocs and nthreads as worker_options to EC2Cluster. If I can help in anyway please let me know. Thank you |
@shireenrao have you tried it? I would expect the following to work. cluster = EC2Cluster(..., worker_class="distributed.nanny.Nanny", worker_options={"nprocs": "n"}) |
@jacobtomlinson - I tried that and it fails. This is the stack trace I see
|
Sorry my bad looks like you should be setting |
@jacobtomlinson - I am trying this on a ec2 instance with 36 vCPU's. Setting ncores and nthreads only starts 1 Nanny process with 36 threads. Here are the logs from the worker
From a jupyter notebook session this is what one sees for the cluster
Where as when I start the worker manually with
Jupyter Notebooks shows the cluster to be
All the CPU's are not being utilized. |
Right I'm with you sorry. It seems that |
Could we please request / create issue in distributed upstream to integrate this work. For my use case it makes more sense to have a few very large machines than lots of small VMs. |
The spec in |
Has anybody been able to achieve this? |
Are there any news on this issue? |
@jacobtomlinson I tried this on
but this errors out with a json encode error in VMCluster
Any ideas? Will this require some changes in cloudprovider? |
@kumarprabhu1988 I think the |
@jacobtomlinson So I tried this, and no ec2 worker machine creation fails and there are no workers created, but it is a silent failure - no errors. |
Could you set |
Sure, I can do that. In the meanwhile I looked at the code and tried a to pass in worker configuration with some code changes and it worked. Here's the change I made. Let me know what you think. |
It may be more efficient to run many workers on larger VMs as opposed to larger numbers of worker processes on small VMs, presumably to avoid inter-VM communication. This was a suggestion that resulted from this thread on optimizing some dask array workflows: https://github.com/pystatgen/sgkit/issues/390.
@quasiben mentioned that threads per worker can be controlled with
worker_options={"nthreads": 2 }
, but there appears to be no way to run more than one worker on a single VM.The text was updated successfully, but these errors were encountered: