Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node - Multi GPU capability #962

Merged
merged 7 commits into from
Sep 27, 2018
Merged

Conversation

vince62s
Copy link
Member

This PR allows to train on several nodes and several GPU:
introducing
-master_ip: ip address of the master node
-master_port: port number of th emaster node
-world_size = total number of processes to be run (total GPUs accross all nodes)
-gpu_ranks = list of indices of processes accross all nodes

-gpuid is deprecated

@sebastianGehrmann
Copy link
Contributor

Hey @vince62s - nice changes! I personally find world_size a little ambiguous. Made me think of the world models paper. Could we change that to something more expressive like total_gpu_process_count?

@vince62s
Copy link
Member Author

Well I agree except that this is exactly the vocabulary used by pytorch distributed (and subsequently Fairseq....)
https://pytorch.org/docs/stable/distributed.html

Sebastian, while you're here :) we really need you on the copy stuff that is broken (see #749 for instance)

@sebastianGehrmann
Copy link
Contributor

I see, haven't really used those. Wonder why they thought it'd be a good idea.

Will have a look at the copy things now that I am back.

@vince62s
Copy link
Member Author

For the records, when using multi node, if like most people you have a regular 1Gbps network, you really need to use accum_count above 8 or 16 to avoid too many inter-node communications.

@vince62s vince62s merged commit 94d2187 into OpenNMT:master Sep 27, 2018
@vince62s vince62s deleted the newdistrib branch October 8, 2018 18:29
@mjc14
Copy link

mjc14 commented Nov 3, 2018

hi,
@vince62s ,i think the codes only support multi gpus inter-node, because if i have two nodes and four gpus in each node, the list of indices can't be 0 ~7, the main process can't have 8 child processes in each node, and i think to split data for to different node is needed.

    procs = []
    for device_id in range(nb_gpu):
        procs.append(mp.Process(target=run, args=(
            opt, device_id, error_queue, ), daemon=True))
        procs[device_id].start()
        logger.info(" Starting process pid: %d  " % procs[device_id].pid)
        error_handler.add_child(procs[device_id].pid)
    for p in procs:
        p.join()

i think the Facebook fairseq will give you more ideas. it support multi gpus inter-node and gpus cross node.
https://github.com/pytorch/fairseq
https://fairseq.readthedocs.io/en/latest/getting_started.html#advanced-training-options

@vince62s
Copy link
Member Author

vince62s commented Nov 3, 2018

it does work.
on master you run world_size 8 gpu_ranks 0 1 2 3, and on node 2 you run world_size 8 gpu_ranks 4 5 6 7
make sure on each you have cuda_visible_devices 0,1,2,3

@mjc14
Copy link

mjc14 commented Nov 3, 2018

sorry, it is my mistake. i read your code carefully in the morning .it did work . because i use slurm to control my jobs ,so i make some mistake. your codes is mostly perfect for my work. thanks very much.

@vince62s
Copy link
Member Author

vince62s commented Nov 3, 2018

FYI one thing:
if your network is slow (eg 1Gbps) it will take time to update accross the 2 nodes depending on the size of your model.
For instance if your model is about 500MB it takes 5-6 sec for each update.
so you'd better use a high -accum (like 16 or 32)
but it all depends on your use case.

@mjc14
Copy link

mjc14 commented Nov 3, 2018

thanks, reduce update times is needs. i have a test result about NCCL2 (all_reduce)on my nodes, it is about 5.09 GB/s on four nodes which has 4 gpus in each node.

@mpatwary
Copy link

Does anyone has sbatch/slurm example to run experiments on multinode multiGPU. I am not sure how to set the master_ip and port when slurm will give nodes on the fly? Helps appreciated.

@mpatwary
Copy link

mpatwary commented Jan 8, 2019

@mjc14 That is a different codebase, right?

I am interested running this codebase in multinode environment using slurm. I tried the following one, but it doesn't produce anything.

sbatch -N 2 -p TitanXx8 --ntasks 2 --gres=gpu:8 --cpus-per-task 2 --job-name=smt --signal=USR1@600 --wrap "srun stdbuf -i0 -o0 -e0 python train.py -data data/demo -save_model demo-model-16 -world_size 16 -gpu_ranks 0 1 2 3 4 5 6 7" --output output-16.out

For single node, using the following command works fine.

sbatch -N 1 -p TitanXx8 --ntasks 1 --gres=gpu:8 --cpus-per-task 2 --job-name=smt --signal=USR1@600 --wrap "srun stdbuf -i0 -o0 -e0 python train.py -data data/demo -save_model demo-model-08 -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7" --output output-08.out

For multinode, I added the following lines in the onmt/utils/distributed.py so that I don't need to pass master_ip in sbatch submission.

def multi_init(opt, device_id):
node_list = os.environ.get('SLURM_JOB_NODELIST')
hostnames = subprocess.check_output(['scontrol', 'show', 'hostnames', node_list])
host=hostnames.split()[0].decode('utf-8')
dist_init_method = 'tcp://{master_ip}:{master_port}'.format(master_ip=host, #opt.master_ip, master_port=opt.master_port)

@mjc14
Copy link

mjc14 commented Jan 9, 2019

yes, it is different codebase. you can not run this codebase in multinode environment using slrum directly. you need change some codes. you can read the fairseq project. it will be useful.
the following tow files is the key .
https://github.com/pytorch/fairseq/blob/master/train.py
https://github.com/pytorch/fairseq/blob/master/fairseq/distributed_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants