-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node - Multi GPU capability #962
Conversation
Hey @vince62s - nice changes! I personally find |
Well I agree except that this is exactly the vocabulary used by pytorch distributed (and subsequently Fairseq....) Sebastian, while you're here :) we really need you on the copy stuff that is broken (see #749 for instance) |
I see, haven't really used those. Wonder why they thought it'd be a good idea. Will have a look at the copy things now that I am back. |
For the records, when using multi node, if like most people you have a regular 1Gbps network, you really need to use accum_count above 8 or 16 to avoid too many inter-node communications. |
hi,
i think the Facebook fairseq will give you more ideas. it support multi gpus inter-node and gpus cross node. |
it does work. |
sorry, it is my mistake. i read your code carefully in the morning .it did work . because i use slurm to control my jobs ,so i make some mistake. your codes is mostly perfect for my work. thanks very much. |
FYI one thing: |
thanks, reduce update times is needs. i have a test result about NCCL2 (all_reduce)on my nodes, it is about 5.09 GB/s on four nodes which has 4 gpus in each node. |
Does anyone has sbatch/slurm example to run experiments on multinode multiGPU. I am not sure how to set the master_ip and port when slurm will give nodes on the fly? Helps appreciated. |
@mjc14 That is a different codebase, right? I am interested running this codebase in multinode environment using slurm. I tried the following one, but it doesn't produce anything.
For single node, using the following command works fine.
For multinode, I added the following lines in the onmt/utils/distributed.py so that I don't need to pass master_ip in sbatch submission.
|
yes, it is different codebase. you can not run this codebase in multinode environment using slrum directly. you need change some codes. you can read the fairseq project. it will be useful. |
This PR allows to train on several nodes and several GPU:
introducing
-master_ip: ip address of the master node
-master_port: port number of th emaster node
-world_size = total number of processes to be run (total GPUs accross all nodes)
-gpu_ranks = list of indices of processes accross all nodes
-gpuid is deprecated