About performing distributed training #35

jry-king · 2023-02-02T07:44:03Z

Hello, and thanks for sharing these great codes. Is it possible to use this trainer on multiple GPUs? I see that it is based on deepspeed but I can't find any configuration files for distributed training. Could you help me on this? Thanks!

eschmidbauer · 2023-02-02T15:01:32Z

python -m torch.distributed.launch --nproc_per_node 2 -m vall_e.train yaml=config/test/ar.yml

This worked for me

jry-king · 2023-02-03T04:06:39Z

@eschmidbauer Thanks for your help. I tried this command with 2 GPUs but the training process doesn't seem to be any faster compared to using 1 GPU. The time elapsed in each step is almost the same. How could I check whether I'm getting my GPUs working correctly?

eschmidbauer · 2023-02-03T12:00:47Z

do you have the nvtop tool? it should show both GPUs in-use or i believe nvidia-smi might give you info

cantabile-kwok · 2023-03-03T14:52:11Z

@eschmidbauer Thanks for your help. I tried this command with 2 GPUs but the training process doesn't seem to be any faster compared to using 1 GPU. The time elapsed in each step is almost the same. How could I check whether I'm getting my GPUs working correctly?

I am experiencing the same issue. However, I guess the convergence speed will be fast as it equivalents to using 2 times larger batch to train the model. Although the elapsed time for each batch may be the same or even higher a little, the training generally becomes faster. That's only my guess, and it'll be better if you can share your training experience, i.e. does it work & does multiple GPUs train faster than single one?

JaejinCho · 2023-03-19T15:24:50Z

python -m torch.distributed.launch --nproc_per_node 2 -m vall_e.train yaml=config/test/ar.yml

This worked for me

@eschmidbauer How does this command work if you want to use multiple nodes w/ more gpus?

It seems that "deepspeed --num_nodes 1 --num_gpus 2 --module vall_e.train yaml=config/test/ar.yml" work similarly for me but can specify --num_nodes easily (I did not test it if it works for >1 though).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About performing distributed training #35

About performing distributed training #35

jry-king commented Feb 2, 2023

eschmidbauer commented Feb 2, 2023

jry-king commented Feb 3, 2023

eschmidbauer commented Feb 3, 2023

cantabile-kwok commented Mar 3, 2023

JaejinCho commented Mar 19, 2023 •

edited

Loading

About performing distributed training #35

About performing distributed training #35

Comments

jry-king commented Feb 2, 2023

eschmidbauer commented Feb 2, 2023

jry-king commented Feb 3, 2023

eschmidbauer commented Feb 3, 2023

cantabile-kwok commented Mar 3, 2023

JaejinCho commented Mar 19, 2023 • edited Loading

JaejinCho commented Mar 19, 2023 •

edited

Loading