Support multiple workers per VM #173

eric-czech · 2020-11-17T22:00:06Z

It may be more efficient to run many workers on larger VMs as opposed to larger numbers of worker processes on small VMs, presumably to avoid inter-VM communication. This was a suggestion that resulted from this thread on optimizing some dask array workflows: https://github.com/pystatgen/sgkit/issues/390.

@quasiben mentioned that threads per worker can be controlled with worker_options={"nthreads": 2 }, but there appears to be no way to run more than one worker on a single VM.

The text was updated successfully, but these errors were encountered:

quasiben · 2020-11-17T22:10:04Z

I think for this to work we need to change:

dask-cloudprovider/dask_cloudprovider/gcp/instances.py

Lines 318 to 330 in d5dfd99

    
                   "--spec", 
        
                   "''%s''"  # in yaml double single quotes escape the single quote 
        
                   % json.dumps( 
        
                       { 
        
                           "cls": self.worker_class, 
        
                           "opts": { 
        
                               **worker_options, 
        
                               "name": self.name, 
        
                           }, 
        
                       } 
        
                   ), 
        
               ] 
        
           )

To handle a spec which looks like the following:

workers_options = {
    'worker-1': {"cls": "dask.distributed.Nanny", "opts": {"nthreads": 1}},
    'worker-2': {"cls": "dask.distributed.Nanny", "opts": {"nthreads": 2}},
}

jacobtomlinson · 2020-11-18T09:45:12Z

When running on a local machine LocalCluster optimises this for you right? On my laptop with 12 cores it runs three workers with four threads each.

Perhaps we can reuse this logic?

quasiben · 2020-11-18T15:12:57Z

You mean, have dask-cloudprovider inspect the system and assign worker/threads accordingly ?

jacobtomlinson · 2020-11-18T15:51:54Z

Yeah. We do this for LocalCluster here.

jacobtomlinson · 2021-01-12T16:07:35Z

dask/distributed#4377 allows us to detect and create worker/threads automatically.

dask-worker --nprocs=auto

Once a release is out we can update things here to use this new option.

eric-valente · 2021-05-03T04:44:28Z

any progress here?

jacobtomlinson · 2021-05-04T09:15:38Z

@eric-valente with the latest release of dask and distributed you should be able to set worker_options={"nprocs": "auto"}. Although I haven't managed to come back here and test this yet.

If you could give it a go and report back it would be much appreciated!

eric-valente · 2021-05-05T00:33:45Z

Hi @jacobtomlinson Tried it with packages and no luck. The workers will turn on in EC2 and won't connect to the scheduler and then proceed to die. When I remove the nprocs it works fine but again only has 1 worker per VM with threads = # of cores

worker_options={'nprocs':'auto'}

dask 2021.4.1 pyhd8ed1ab_0 conda-forge
dask-cloudprovider 2021.3.1 pyhd8ed1ab_0 conda-forge
dask-core 2021.4.1 pyhd8ed1ab_0 conda-forge
distributed 2021.4.1 py38h578d9bd_0 conda-forge

jacobtomlinson · 2021-05-05T14:56:11Z

Strange! Are you able to get any logs from the EC2 instances? They tend to get dumped in var/log/cloud-init-output.log.

eric-valente · 2021-05-05T15:12:31Z

@jacobtomlinson

This error then worker dies:

Here are my settings:

Note, this works if i simply remove nprocs from the worker_options. Thanks for your help!

jacobtomlinson · 2021-05-06T13:51:20Z

Thanks @eric-valente. You also need to set the worker_class kwarg to "dask.distributed.Nanny".

The traceback you got above looks like a bug, I'll raise that back in distributed.

@quasiben what do you think about making dask.distributed.Nanny the default instead of dask.distributed.Worker?

eric-valente · 2021-05-06T17:58:18Z

Thanks again for your help here @jacobtomlinson

Tried setting the worker class to Nanny but still same issue:

eric-valente · 2021-05-06T18:16:28Z

dask/distributed#4640 seems maybe related

eric-valente · 2021-05-06T19:53:24Z

Seems dask.distributed.Nanny does not accept nprocs either
https://distributed.dask.org/en/latest/_modules/distributed/nanny.html#Nanny

Fails with above init error:
--spec '{"cls": "dask.distributed.Nanny", "opts": {"nprocs": 4, "name": "dask-4ffc56fd-worker-dd6509e3"}}'

Works:
--spec '{"cls": "dask.distributed.Nanny", "opts": {"ncores": 4, "name": "dask-4ffc56fd-worker-dd6509e3"}}'

eric-valente · 2021-05-06T20:19:52Z

Seems like the cloud-init to add a worker uses this style of starting a worker:
python -m distributed.cli.dask_spec tcp://x.x.x.x:8786 --spec {"cls": "distributed.nanny.Nanny", "opts": {"nprocs": 4, "name": "dask-4ffc56fd-worker-dd6509e3"}}

vs. this style:
dask-worker --nprocs=auto

quasiben · 2021-05-06T20:20:05Z

+1 to making dask.distributed.Nanny the default.

eric-valente · 2021-05-06T23:57:09Z

@jacobtomlinson Yeah it seems like EC2Cluster uses python -m distributed.cli.dask_spec and passing in worker_class

As suggested above, I think you might need to accept multiiple workers defined in worker_options and skip nprocs?

workers_options = {
    'worker-1': {"cls": "dask.distributed.Nanny", "opts": {"nthreads": 1}},
    'worker-2': {"cls": "dask.distributed.Nanny", "opts": {"nthreads": 2}},
}

I think I could use worker_module but it is not a valid parameter for EC2Cluster:

This would allow me to use dask-worker with --nprocs=x

jacobtomlinson · 2021-05-07T08:47:04Z

I think I could use worker_module but it is not a valid parameter for EC2Cluster

It should be, have you tried it?

shireenrao · 2021-10-19T17:27:42Z

Hello
Is there an update on this issue? I am trying to do something similar. Currently I am creating my cluster manually where I start the scheduler on one EC2 and then on another EC2 I create the workers
like this:

dask-worker  tcp://XX.X.X.XXX:8786 --nprocs n --nthreads 1

Where n is the number of CPU's on that EC2.

It would be great if this can be done by passing nprocs and nthreads as worker_options to EC2Cluster. If I can help in anyway please let me know.

Thank you

jacobtomlinson · 2021-10-21T08:48:01Z

@shireenrao have you tried it?

I would expect the following to work.

cluster = EC2Cluster(..., worker_class="distributed.nanny.Nanny", worker_options={"nprocs": "n"})

shireenrao · 2021-10-21T11:11:55Z

@jacobtomlinson - I tried that and it fails. This is the stack trace I see

python -m distributed.cli.dask_spec tcp://XXX.X.X.XXX:8786 --spec '{"cls": "distributed.nanny.Nanny", "opts": {"nprocs":"4", "name": "dask-44bad6fe-worker-ecaffb24"}}'
no environment.yml
distributed.nanny - INFO -         Start Nanny at: 'tcp://XXX.17.0.X:41233'
distributed.nanny - ERROR - Failed to initialize Worker
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 844, in _run
    worker = Worker(**worker_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 733, in __init__
    super().__init__(
TypeError: __init__() got an unexpected keyword argument 'nprocs'
distributed.nanny - ERROR - Failed while trying to start worker process: __init__() got an unexpected keyword argument 'nprocs'
distributed.nanny - INFO - Closing Nanny at 'tcp://XXX.17.0.X:41233'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_spec.py", line 43, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_spec.py", line 39, in main
    asyncio.get_event_loop().run_until_complete(run())
  File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_spec.py", line 33, in run
    servers = await run_spec(_spec, *args)
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 659, in run_spec
    await asyncio.gather(*workers.values())
  File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 690, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 283, in _
    await self.start()
  File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 335, in start
    response = await self.instantiate()
  File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 418, in instantiate
    result = await self.process.start()
  File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 692, in start
    msg = await self._wait_until_connected(uid)
  File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 811, in _wait_until_connected
    raise msg["exception"]
  File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 844, in _run
    worker = Worker(**worker_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 733, in __init__
    super().__init__(
TypeError: __init__() got an unexpected keyword argument 'nprocs'

jacobtomlinson · 2021-10-21T11:56:22Z

Sorry my bad looks like you should be setting nthreads and ncores.

https://github.com/dask/distributed/blob/3afc6703fb480b9fe5c983c84d6ea115810dd1ff/distributed/nanny.py#L88-L89

shireenrao · 2021-10-21T12:45:58Z

@jacobtomlinson - I am trying this on a ec2 instance with 36 vCPU's. Setting ncores and nthreads only starts 1 Nanny process with 36 threads. Here are the logs from the worker

python -m distributed.cli.dask_spec tcp://XX.XXX.XX.XX:8786 --spec '{"cls": "distributed.nanny.Nanny", "opts": {"ncores":36, "nthreads":1, "name": "dask-44bad6fe-worker-ecaffb24"}}'
/opt/conda/lib/python3.8/site-packages/distributed/nanny.py:172: UserWarning: the ncores= parameter has moved to nthreads=
  warnings.warn("the ncores= parameter has moved to nthreads=")
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.17.0.2:45419'
distributed.worker - INFO -       Start worker at:     tcp://172.17.0.2:42719
distributed.worker - INFO -          Listening to:     tcp://172.17.0.2:42719
distributed.worker - INFO -          dashboard at:           172.17.0.2:45887
distributed.worker - INFO - Waiting to connect to:    tcp://XX.XXX.XX.XX:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         36
distributed.worker - INFO -                Memory:                  68.59 GiB
distributed.worker - INFO -       Local Directory: /dask-worker-space/worker-2ywv7v5h
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://XX.XXX.XX.XX:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

From a jupyter notebook session this is what one sees for the cluster

<Client: 'tcp://172.17.0.2:8786' processes=1 threads=36, memory=68.59 GiB>

Where as when I start the worker manually with

dask-worker tcp://XX.XXX.XX.XX:8786 --nprocs=36

Jupyter Notebooks shows the cluster to be

<Client: 'tcp://172.17.0.2:8786' processes=36 threads=36, memory=68.59 GiB>

All the CPU's are not being utilized.

jacobtomlinson · 2021-10-21T14:24:31Z

Right I'm with you sorry. It seems that dask_spec doesn't support that today. Implementing that would have to be done upstream in distributed.

ansar-sa · 2021-12-02T01:16:03Z

Could we please request / create issue in distributed upstream to integrate this work.

For my use case it makes more sense to have a few very large machines than lots of small VMs.

jacobtomlinson · 2021-12-02T11:14:57Z

The spec in distributes now accepts a list of specs. This does assume that nodes are regular as we would need to specify number of workers and cores/memory for each worker. But we should be able to implement this here now without needing any upstream changes.

cspagemarine · 2022-06-10T15:16:28Z

Has anybody been able to achieve this?

emkademy · 2022-10-05T09:56:40Z

Are there any news on this issue?

kumarprabhu1988 · 2022-11-04T05:28:42Z

@jacobtomlinson I tried this on EC2Cluster like this:

ec2_cluster = EC2Cluster(
    env_vars=env_vars,
    extra_bootstrap=EC2_BOOTSTRAP_COMMANDS,
    filesystem_size=cluster_config["volume_size"],
    instance_tags=cluster_config["ec2_instance_tags"],
    n_workers=cluster_config["n_workers"],
    #worker_class="distributed.nanny.Nanny", 
    worker_options={i: {"cls": Nanny, "options": {"nthreads": 2}} for i in range(4)},
    scheduler_instance_type=cluster_config["scheduler_instance_type"],
    auto_shutdown=False,
    shutdown_on_close=False,
    security=False,  # https://github.com/dask/dask-cloudprovider/issues/249,
    volume_tags=cluster_config["ec2_instance_tags"],
    worker_instance_type=cluster_config["worker_instance_type"],
)

but this errors out with a json encode error in VMCluster

~/miniconda3/envs/pricing/lib/python3.7/site-packages/dask_cloudprovider/generic/vmcluster.py in __init__(self, scheduler, worker_module, worker_class, worker_options, *args, **kwargs)
    138                             "opts": {
    139                                 **worker_options,
--> 140                                 "name": self.name,
    141                             },
    142                         }

~/miniconda3/envs/pricing/lib/python3.7/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    229         cls is None and indent is None and separators is None and
    230         default is None and not sort_keys and not kw):
--> 231         return _default_encoder.encode(obj)
    232     if cls is None:
    233         cls = JSONEncoder

Any ideas? Will this require some changes in cloudprovider?

jacobtomlinson · 2022-11-04T15:51:24Z

@kumarprabhu1988 I think the cls need to be a string, and it probably needs to be "distributed.nanny.Nanny".

kumarprabhu1988 · 2022-11-04T16:28:35Z

@jacobtomlinson So I tried this, and no ec2 worker machine creation fails and there are no workers created, but it is a silent failure - no errors.

jacobtomlinson · 2022-11-07T10:52:17Z

Could you set debug=True and grab the /var/log/cloud-init-output.log logs from the instances so we can see what went wrong?

kumarprabhu1988 · 2022-11-08T04:28:13Z

Sure, I can do that. In the meanwhile I looked at the code and tried a to pass in worker configuration with some code changes and it worked. Here's the change I made. Let me know what you think.

eric-czech mentioned this issue Nov 17, 2020

Identify lack of scalability in gwas_linear_regression sgkit-dev/sgkit#390

Open

jacobtomlinson added the provider/gcp/vm Cluster provider for GCP Instances label Nov 20, 2020

tomwhite mentioned this issue Jan 12, 2021

GCP worker disk size is too low #235

Closed

kumarprabhu1988 mentioned this issue Nov 8, 2022

add new param to allow multiple workers in vmcluster, worker_mixin #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple workers per VM #173

Support multiple workers per VM #173

eric-czech commented Nov 17, 2020

quasiben commented Nov 17, 2020

jacobtomlinson commented Nov 18, 2020

quasiben commented Nov 18, 2020

jacobtomlinson commented Nov 18, 2020

jacobtomlinson commented Jan 12, 2021

eric-valente commented May 3, 2021

jacobtomlinson commented May 4, 2021

eric-valente commented May 5, 2021

jacobtomlinson commented May 5, 2021

eric-valente commented May 5, 2021 •

edited

Loading

jacobtomlinson commented May 6, 2021

eric-valente commented May 6, 2021

eric-valente commented May 6, 2021

eric-valente commented May 6, 2021 •

edited

Loading

eric-valente commented May 6, 2021

quasiben commented May 6, 2021

eric-valente commented May 6, 2021 •

edited

Loading

jacobtomlinson commented May 7, 2021

shireenrao commented Oct 19, 2021 •

edited

Loading

jacobtomlinson commented Oct 21, 2021

shireenrao commented Oct 21, 2021

jacobtomlinson commented Oct 21, 2021

shireenrao commented Oct 21, 2021

jacobtomlinson commented Oct 21, 2021

ansar-sa commented Dec 2, 2021

jacobtomlinson commented Dec 2, 2021

cspagemarine commented Jun 10, 2022

emkademy commented Oct 5, 2022

kumarprabhu1988 commented Nov 4, 2022 •

edited

Loading

jacobtomlinson commented Nov 4, 2022

kumarprabhu1988 commented Nov 4, 2022 •

edited

Loading

jacobtomlinson commented Nov 7, 2022

kumarprabhu1988 commented Nov 8, 2022

Support multiple workers per VM #173

Support multiple workers per VM #173

Comments

eric-czech commented Nov 17, 2020

quasiben commented Nov 17, 2020

jacobtomlinson commented Nov 18, 2020

quasiben commented Nov 18, 2020

jacobtomlinson commented Nov 18, 2020

jacobtomlinson commented Jan 12, 2021

eric-valente commented May 3, 2021

jacobtomlinson commented May 4, 2021

eric-valente commented May 5, 2021

jacobtomlinson commented May 5, 2021

eric-valente commented May 5, 2021 • edited Loading

jacobtomlinson commented May 6, 2021

eric-valente commented May 6, 2021

eric-valente commented May 6, 2021

eric-valente commented May 6, 2021 • edited Loading

eric-valente commented May 6, 2021

quasiben commented May 6, 2021

eric-valente commented May 6, 2021 • edited Loading

jacobtomlinson commented May 7, 2021

shireenrao commented Oct 19, 2021 • edited Loading

jacobtomlinson commented Oct 21, 2021

shireenrao commented Oct 21, 2021

jacobtomlinson commented Oct 21, 2021

shireenrao commented Oct 21, 2021

jacobtomlinson commented Oct 21, 2021

ansar-sa commented Dec 2, 2021

jacobtomlinson commented Dec 2, 2021

cspagemarine commented Jun 10, 2022

emkademy commented Oct 5, 2022

kumarprabhu1988 commented Nov 4, 2022 • edited Loading

jacobtomlinson commented Nov 4, 2022

kumarprabhu1988 commented Nov 4, 2022 • edited Loading

jacobtomlinson commented Nov 7, 2022

kumarprabhu1988 commented Nov 8, 2022

eric-valente commented May 5, 2021 •

edited

Loading

eric-valente commented May 6, 2021 •

edited

Loading

eric-valente commented May 6, 2021 •

edited

Loading

shireenrao commented Oct 19, 2021 •

edited

Loading

kumarprabhu1988 commented Nov 4, 2022 •

edited

Loading

kumarprabhu1988 commented Nov 4, 2022 •

edited

Loading