Add Multi-Node support for DDP Training #504

NanoCode012 · 2020-07-24T14:12:42Z

Fix #501 .

Unit test (CPU, Single GPU, DP mode, 2 GPU DDP mode)
Single machine, two terminal
Multiple machines, one terminal each

It works on my single machine from two different terminals.

Command:

# node 1
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 0,1

# node 2
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_port=1234 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 2,3

Please set the ip and port to somewhere that is not blocked if you are running from two machines.
If you are running from one machine, you can omit the master_addr and port.

Would it be necessary to run it on COCO2017 to see the training time decrease?

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improved Distributed Data Parallel (DDP) support with rank handling in training code.

📊 Key Changes

Changed opt.local_rank to opt.global_rank to better represent the variable's intent.
Updated DDP model device assignment to use opt.local_rank instead of rank.
Modified rank verification for git status checking and Tensorboard initialization to accommodate a broader range of environments.
Initialized opt.global_rank with a default value and updated it within the DDP setup.

🎯 Purpose & Impact

Ensures more understandable and maintainable code by using global_rank, which reflects its usage more accurately than local_rank.
The changes improve support for various distributed training setups, potentially offering a smoother experience for users training models on multiple GPUs.
May lead to fewer errors and easier debugging when users configure distributed training environments.

glenn-jocher · 2020-07-24T18:14:09Z

@NanoCode012 wow these use cases are getting complicated. So this is if someone wants to run multiple multi-gpu trainings on a single machine?

Should I merge or should we wait for further testing?

NanoCode012 · 2020-07-24T18:28:59Z

Ok. If they want to run multiple multi-gpu trainings on single machine, they only need to choose a different master_port.

It’s getting harder to test with these many configs.

Currently, I assume it should work for multiple machines But, I have not tested it, so I cannot say for sure.

There is one Issue open on it, I hope they test it.. Magic also said he had the machine available, so we should see.

glenn-jocher · 2020-07-24T18:34:18Z

@NanoCode012 ok lets wait then and see if we can get test confirmation.

NanoCode012 · 2020-07-24T18:40:11Z

Sorry I misread your question, it’s a bit late here.

This code is for people wanting to train from Multiple machines(nodes). Let’s say, two pc with 2 2080Ti each. We can utilize it to train as if 4. I’m just trying to replicate it by using two terminals.

They have to give an ip (belonging to Master machine GPU 0), and it’s port. This way all the other machines can connect to it.

NanoCode012 · 2020-07-25T06:56:14Z

@MagicFrogSJTU confirm code works. I think this PR is good to go!

Edit: I set my own UnitTest to run again in case merge from master created problems.
Edit2: UnitTest success.
Edit3: Only train.py has been changed. Others is because of commit history.

glenn-jocher · 2020-07-25T17:48:14Z

I was looking through this, and realized most of the changes shown are already in master, which makes for a confusing review. I found this: https://stackoverflow.com/questions/16306012/github-pull-request-showing-commits-that-are-already-in-target-branch

glenn-jocher · 2020-07-25T18:01:00Z

@NanoCode012 I tried the edit button solution, switched from master to dev branch and back again but I see no changes. The other main solution is to rebase. Can you try this?

git fetch origin
git checkout feature-01
git rebase origin/master
git push --force

NanoCode012 · 2020-07-25T18:01:47Z

Sorry for the mess of files. It happened when I updated this branch to master. It keeps doing this. I'm not sure how to fix it. My current reasoning is that when I PR from your repo to my master, I did a SquashMerge, and from there on, it always detected the two branch's to be on different commits.

As stated in previous comment, the only changes are in train.py.

Edit: Will try your method!

glenn-jocher · 2020-07-25T18:07:04Z

Yes I know, I don't understand why github handles diffs this way but it appears to be on purpose based on the stackoverflow responses. It makes it super complicated to understand what's actually being changed and what's not.

NanoCode012 · 2020-07-25T18:07:16Z

Hello, I did it on a test branch first. I don't think it works. Nothing happens.

master...NanoCode012:muti_node_test

glenn-jocher · 2020-07-25T18:14:28Z

If I use the normal PR branch I still see 7 changes:
NanoCode012/yolov5@muti_node...ultralytics:master

If I use your test branch now I only see 1: EDIT: this diff is from your master to your mutli_node_test I think?
NanoCode012/yolov5@master...muti_node_test

EDIT2: When I use the diff to ultralytics:master I still see 7 changes, from multi_node and multi_node_master. I think you might want to try running the rebase code I pasted before and see if that works.
master...NanoCode012:muti_node_test

EDIT3: The main problem seems to be that there are 6 diffs between my master and your master. I think running git pull, or merging or rebasing will solve this.
master...NanoCode012:master

NanoCode012 · 2020-07-25T18:18:52Z

If I use your test branch now I only see 1: EDIT: this diff is from your master to your mutli_node_test I think?
NanoCode012/yolov5@master...muti_node_test

Yes, that is from my master to my branch.

Edit: Right now, I think there was mistake with that rebase code. I needed to set it to rebase to your repo master and not mine.

glenn-jocher · 2020-07-25T18:23:50Z

Edit: Right now, I think there was mistake with that rebase code. I needed to set it to rebase to your repo master and not mine.

I assume with this what git fetch origin solves. Note that git pull = git fetch origin + git merge

NanoCode012 · 2020-07-25T18:42:59Z

Hello, I finally got it working.. I had to set origin to your repo and mine back.. Good lesson for me thanks.

glenn-jocher · 2020-07-25T18:53:34Z

@NanoCode012 hey perfect! What were your steps so we can advise others how to do this in the future?

NanoCode012 · 2020-07-25T19:00:06Z

The code you supplied above/from SO is for within same repo. I am from a fork. What I did was changing the remote origin, but a better way would be to create a new remote.

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature-01
git rebase upstream/master
git push -u origin -f

Change feature-01 to name of branch in fork.

NanoCode012 · 2020-08-03T03:47:40Z

Hi @glenn-jocher , would you still like this PR? How about the #500 ?

Tested working on UnitTest.

glenn-jocher · 2020-08-03T19:08:15Z

@NanoCode012 yes. Have you tested this with multi-gpu?

NanoCode012 · 2020-08-04T01:27:08Z

Yes. DDP from one terminal and DDP from two (to mimic two different nodes). Also confirmed working by Magic and #501 .

glenn-jocher · 2020-08-06T17:59:00Z

@NanoCode012 let's do the #500 and #504 merges both, since they are closely related. Do you have any preference on which to merge first? This one appears clear to merge, no conflicts. I'll re-run the CI checks now. Can you rebase this one also to origin/maser?

glenn-jocher · 2020-08-06T18:10:26Z

CI re-passed now. PR is ready to merge.

EDIT: waiting on thumbs up from @NanoCode012 to merge.

NanoCode012 · 2020-08-06T18:13:12Z

Good to go @glenn-jocher !

glenn-jocher · 2020-08-06T18:15:47Z

@NanoCode012 awesome, all done. Thanks for your hard work!

* Add support for multi-node DDP * Remove local_rank confusion * Fix spacing

NanoCode012 marked this pull request as ready for review July 25, 2020 06:59

glenn-jocher changed the base branch from master to dev July 25, 2020 17:51

glenn-jocher changed the base branch from dev to master July 25, 2020 17:51

glenn-jocher changed the base branch from master to dev July 25, 2020 17:51

glenn-jocher changed the base branch from dev to master July 25, 2020 17:52

NanoCode012 added 2 commits July 26, 2020 01:41

Add support for multi-node DDP

1ab6317

Remove local_rank confusion

9372052

NanoCode012 force-pushed the muti_node branch from fda2603 to 9372052 Compare July 25, 2020 18:42

NanoCode012 added 2 commits August 3, 2020 10:25

Merge branch 'master' into muti_node

7db6446

Fix spacing

56a8ce6

glenn-jocher mentioned this pull request Aug 6, 2020

Fix redundant outputs via Logging in DDP training #500

Merged

4 tasks

glenn-jocher assigned NanoCode012 and glenn-jocher Aug 6, 2020

glenn-jocher added the TODO High priority items label Aug 6, 2020

glenn-jocher merged commit 886b984 into ultralytics:master Aug 6, 2020

NanoCode012 deleted the muti_node branch August 7, 2020 19:15

glenn-jocher removed the TODO High priority items label Nov 12, 2020

burglarhobbit pushed a commit to burglarhobbit/yolov5 that referenced this pull request Jan 1, 2021

Add Multi-Node support for DDP Training (ultralytics#504)

91f6ea4

* Add support for multi-node DDP * Remove local_rank confusion * Fix spacing

KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021

Add Multi-Node support for DDP Training (ultralytics#504)

104b590

* Add support for multi-node DDP * Remove local_rank confusion * Fix spacing

BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022

Add Multi-Node support for DDP Training (ultralytics#504)

16c4c72

* Add support for multi-node DDP * Remove local_rank confusion * Fix spacing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multi-Node support for DDP Training #504

Add Multi-Node support for DDP Training #504

NanoCode012 commented Jul 24, 2020 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jul 24, 2020

NanoCode012 commented Jul 24, 2020

glenn-jocher commented Jul 24, 2020

NanoCode012 commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020

glenn-jocher commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 •

edited by glenn-jocher

Loading

NanoCode012 commented Aug 3, 2020

glenn-jocher commented Aug 3, 2020

NanoCode012 commented Aug 4, 2020

glenn-jocher commented Aug 6, 2020

glenn-jocher commented Aug 6, 2020 •

edited

Loading

NanoCode012 commented Aug 6, 2020

glenn-jocher commented Aug 6, 2020

Add Multi-Node support for DDP Training #504

Add Multi-Node support for DDP Training #504

Conversation

NanoCode012 commented Jul 24, 2020 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

glenn-jocher commented Jul 24, 2020

NanoCode012 commented Jul 24, 2020

glenn-jocher commented Jul 24, 2020

NanoCode012 commented Jul 24, 2020 • edited Loading

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020

glenn-jocher commented Jul 25, 2020 • edited Loading

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020 • edited Loading

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 • edited by glenn-jocher Loading

NanoCode012 commented Aug 3, 2020

glenn-jocher commented Aug 3, 2020

NanoCode012 commented Aug 4, 2020

glenn-jocher commented Aug 6, 2020

glenn-jocher commented Aug 6, 2020 • edited Loading

NanoCode012 commented Aug 6, 2020

glenn-jocher commented Aug 6, 2020

NanoCode012 commented Jul 24, 2020 •

edited by UltralyticsAssistant

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited by glenn-jocher

Loading

glenn-jocher commented Aug 6, 2020 •

edited

Loading