Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About performing distributed training #35

Open
jry-king opened this issue Feb 2, 2023 · 5 comments
Open

About performing distributed training #35

jry-king opened this issue Feb 2, 2023 · 5 comments

Comments

@jry-king
Copy link

jry-king commented Feb 2, 2023

Hello, and thanks for sharing these great codes. Is it possible to use this trainer on multiple GPUs? I see that it is based on deepspeed but I can't find any configuration files for distributed training. Could you help me on this? Thanks!

@eschmidbauer
Copy link

python -m torch.distributed.launch --nproc_per_node 2 -m vall_e.train yaml=config/test/ar.yml

This worked for me

@jry-king
Copy link
Author

jry-king commented Feb 3, 2023

@eschmidbauer Thanks for your help. I tried this command with 2 GPUs but the training process doesn't seem to be any faster compared to using 1 GPU. The time elapsed in each step is almost the same. How could I check whether I'm getting my GPUs working correctly?

@eschmidbauer
Copy link

do you have the nvtop tool? it should show both GPUs in-use or i believe nvidia-smi might give you info

@cantabile-kwok
Copy link

@eschmidbauer Thanks for your help. I tried this command with 2 GPUs but the training process doesn't seem to be any faster compared to using 1 GPU. The time elapsed in each step is almost the same. How could I check whether I'm getting my GPUs working correctly?

I am experiencing the same issue. However, I guess the convergence speed will be fast as it equivalents to using 2 times larger batch to train the model. Although the elapsed time for each batch may be the same or even higher a little, the training generally becomes faster. That's only my guess, and it'll be better if you can share your training experience, i.e. does it work & does multiple GPUs train faster than single one?

@JaejinCho
Copy link

JaejinCho commented Mar 19, 2023

python -m torch.distributed.launch --nproc_per_node 2 -m vall_e.train yaml=config/test/ar.yml

This worked for me

@eschmidbauer How does this command work if you want to use multiple nodes w/ more gpus?

It seems that "deepspeed --num_nodes 1 --num_gpus 2 --module vall_e.train yaml=config/test/ar.yml" work similarly for me but can specify --num_nodes easily (I did not test it if it works for >1 though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants