Multi node - Multi GPU capability #962

vince62s · 2018-09-18T10:00:39Z

This PR allows to train on several nodes and several GPU:
introducing
-master_ip: ip address of the master node
-master_port: port number of th emaster node
-world_size = total number of processes to be run (total GPUs accross all nodes)
-gpu_ranks = list of indices of processes accross all nodes

-gpuid is deprecated

sebastianGehrmann · 2018-09-18T13:44:42Z

Hey @vince62s - nice changes! I personally find world_size a little ambiguous. Made me think of the world models paper. Could we change that to something more expressive like total_gpu_process_count?

vince62s · 2018-09-18T13:53:57Z

Well I agree except that this is exactly the vocabulary used by pytorch distributed (and subsequently Fairseq....)
https://pytorch.org/docs/stable/distributed.html

Sebastian, while you're here :) we really need you on the copy stuff that is broken (see #749 for instance)

sebastianGehrmann · 2018-09-18T13:56:22Z

I see, haven't really used those. Wonder why they thought it'd be a good idea.

Will have a look at the copy things now that I am back.

vince62s · 2018-09-18T13:58:36Z

For the records, when using multi node, if like most people you have a regular 1Gbps network, you really need to use accum_count above 8 or 16 to avoid too many inter-node communications.

mjc14 · 2018-11-03T02:13:25Z

hi,
@vince62s ,i think the codes only support multi gpus inter-node, because if i have two nodes and four gpus in each node, the list of indices can't be 0 ~7, the main process can't have 8 child processes in each node, and i think to split data for to different node is needed.

    procs = []
    for device_id in range(nb_gpu):
        procs.append(mp.Process(target=run, args=(
            opt, device_id, error_queue, ), daemon=True))
        procs[device_id].start()
        logger.info(" Starting process pid: %d  " % procs[device_id].pid)
        error_handler.add_child(procs[device_id].pid)
    for p in procs:
        p.join()

i think the Facebook fairseq will give you more ideas. it support multi gpus inter-node and gpus cross node.
https://github.com/pytorch/fairseq
https://fairseq.readthedocs.io/en/latest/getting_started.html#advanced-training-options

vince62s · 2018-11-03T07:36:27Z

it does work.
on master you run world_size 8 gpu_ranks 0 1 2 3, and on node 2 you run world_size 8 gpu_ranks 4 5 6 7
make sure on each you have cuda_visible_devices 0,1,2,3

mjc14 · 2018-11-03T07:50:33Z

sorry, it is my mistake. i read your code carefully in the morning .it did work . because i use slurm to control my jobs ,so i make some mistake. your codes is mostly perfect for my work. thanks very much.

vince62s · 2018-11-03T07:55:48Z

FYI one thing:
if your network is slow (eg 1Gbps) it will take time to update accross the 2 nodes depending on the size of your model.
For instance if your model is about 500MB it takes 5-6 sec for each update.
so you'd better use a high -accum (like 16 or 32)
but it all depends on your use case.

mjc14 · 2018-11-03T08:14:27Z

thanks, reduce update times is needs. i have a test result about NCCL2 （all_reduce）on my nodes, it is about 5.09 GB/s on four nodes which has 4 gpus in each node.

mpatwary · 2018-12-20T19:08:13Z

Does anyone has sbatch/slurm example to run experiments on multinode multiGPU. I am not sure how to set the master_ip and port when slurm will give nodes on the fly? Helps appreciated.

mjc14 · 2018-12-21T09:16:42Z

@mpatwary https://fairseq.readthedocs.io/en/latest/getting_started.html#advanced-training-options

mpatwary · 2019-01-08T23:52:50Z

@mjc14 That is a different codebase, right?

I am interested running this codebase in multinode environment using slurm. I tried the following one, but it doesn't produce anything.

sbatch -N 2 -p TitanXx8 --ntasks 2 --gres=gpu:8 --cpus-per-task 2 --job-name=smt --signal=USR1@600 --wrap "srun stdbuf -i0 -o0 -e0 python train.py -data data/demo -save_model demo-model-16 -world_size 16 -gpu_ranks 0 1 2 3 4 5 6 7" --output output-16.out

For single node, using the following command works fine.

sbatch -N 1 -p TitanXx8 --ntasks 1 --gres=gpu:8 --cpus-per-task 2 --job-name=smt --signal=USR1@600 --wrap "srun stdbuf -i0 -o0 -e0 python train.py -data data/demo -save_model demo-model-08 -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7" --output output-08.out

For multinode, I added the following lines in the onmt/utils/distributed.py so that I don't need to pass master_ip in sbatch submission.

def multi_init(opt, device_id):
node_list = os.environ.get('SLURM_JOB_NODELIST')
hostnames = subprocess.check_output(['scontrol', 'show', 'hostnames', node_list])
host=hostnames.split()[0].decode('utf-8')
dist_init_method = 'tcp://{master_ip}:{master_port}'.format(master_ip=host, #opt.master_ip, master_port=opt.master_port)

mjc14 · 2019-01-09T01:14:41Z

yes, it is different codebase. you can not run this codebase in multinode environment using slrum directly. you need change some codes. you can read the fairseq project. it will be useful.
the following tow files is the key .
https://github.com/pytorch/fairseq/blob/master/train.py
https://github.com/pytorch/fairseq/blob/master/fairseq/distributed_utils.py

vince62s added 6 commits September 17, 2018 18:56

new distrib

9f014fe

more newdistrib

48a760d

more fixes

14f8202

fix doc newdistrib

63f8eee

fix case single process

1f38423

fix flakes

a607791

vince62s mentioned this pull request Sep 18, 2018

Option -gpuid not working as it should #840

Closed

Merge branch 'master' into newdistrib

13386e5

vince62s merged commit 94d2187 into OpenNMT:master Sep 27, 2018

vince62s deleted the newdistrib branch October 8, 2018 18:29

francoishernandez mentioned this pull request May 23, 2019

Multiple GPUs issue #1442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi node - Multi GPU capability #962

Multi node - Multi GPU capability #962

vince62s commented Sep 18, 2018

sebastianGehrmann commented Sep 18, 2018

vince62s commented Sep 18, 2018

sebastianGehrmann commented Sep 18, 2018

vince62s commented Sep 18, 2018

mjc14 commented Nov 3, 2018 •

edited

Loading

vince62s commented Nov 3, 2018

mjc14 commented Nov 3, 2018

vince62s commented Nov 3, 2018

mjc14 commented Nov 3, 2018

mpatwary commented Dec 20, 2018

mjc14 commented Dec 21, 2018

mpatwary commented Jan 8, 2019

mjc14 commented Jan 9, 2019

Multi node - Multi GPU capability #962

Multi node - Multi GPU capability #962

Conversation

vince62s commented Sep 18, 2018

sebastianGehrmann commented Sep 18, 2018

vince62s commented Sep 18, 2018

sebastianGehrmann commented Sep 18, 2018

vince62s commented Sep 18, 2018

mjc14 commented Nov 3, 2018 • edited Loading

vince62s commented Nov 3, 2018

mjc14 commented Nov 3, 2018

vince62s commented Nov 3, 2018

mjc14 commented Nov 3, 2018

mpatwary commented Dec 20, 2018

mjc14 commented Dec 21, 2018

mpatwary commented Jan 8, 2019

mjc14 commented Jan 9, 2019

mjc14 commented Nov 3, 2018 •

edited

Loading