-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About performing distributed training #35
Comments
This worked for me |
@eschmidbauer Thanks for your help. I tried this command with 2 GPUs but the training process doesn't seem to be any faster compared to using 1 GPU. The time elapsed in each step is almost the same. How could I check whether I'm getting my GPUs working correctly? |
do you have the |
I am experiencing the same issue. However, I guess the convergence speed will be fast as it equivalents to using 2 times larger batch to train the model. Although the elapsed time for each batch may be the same or even higher a little, the training generally becomes faster. That's only my guess, and it'll be better if you can share your training experience, i.e. does it work & does multiple GPUs train faster than single one? |
@eschmidbauer How does this command work if you want to use multiple nodes w/ more gpus? It seems that "deepspeed --num_nodes 1 --num_gpus 2 --module vall_e.train yaml=config/test/ar.yml" work similarly for me but can specify --num_nodes easily (I did not test it if it works for >1 though). |
Hello, and thanks for sharing these great codes. Is it possible to use this trainer on multiple GPUs? I see that it is based on deepspeed but I can't find any configuration files for distributed training. Could you help me on this? Thanks!
The text was updated successfully, but these errors were encountered: