Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Multi-Node support for DDP Training #504

Merged
merged 4 commits into from
Aug 6, 2020

Conversation

NanoCode012
Copy link
Contributor

@NanoCode012 NanoCode012 commented Jul 24, 2020

Fix #501 .

  • Unit test (CPU, Single GPU, DP mode, 2 GPU DDP mode)

  • Single machine, two terminal

  • Multiple machines, one terminal each

It works on my single machine from two different terminals.

Command:

# node 1
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 0,1
# node 2
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_port=1234 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 2,3

Please set the ip and port to somewhere that is not blocked if you are running from two machines.
If you are running from one machine, you can omit the master_addr and port.

Would it be necessary to run it on COCO2017 to see the training time decrease?

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Improved Distributed Data Parallel (DDP) support with rank handling in training code.

📊 Key Changes

  • Changed opt.local_rank to opt.global_rank to better represent the variable's intent.
  • Updated DDP model device assignment to use opt.local_rank instead of rank.
  • Modified rank verification for git status checking and Tensorboard initialization to accommodate a broader range of environments.
  • Initialized opt.global_rank with a default value and updated it within the DDP setup.

🎯 Purpose & Impact

  • Ensures more understandable and maintainable code by using global_rank, which reflects its usage more accurately than local_rank.
  • The changes improve support for various distributed training setups, potentially offering a smoother experience for users training models on multiple GPUs.
  • May lead to fewer errors and easier debugging when users configure distributed training environments.

@glenn-jocher
Copy link
Member

@NanoCode012 wow these use cases are getting complicated. So this is if someone wants to run multiple multi-gpu trainings on a single machine?

Should I merge or should we wait for further testing?

@NanoCode012
Copy link
Contributor Author

Ok. If they want to run multiple multi-gpu trainings on single machine, they only need to choose a different master_port.

It’s getting harder to test with these many configs.

Currently, I assume it should work for multiple machines But, I have not tested it, so I cannot say for sure.

There is one Issue open on it, I hope they test it.. Magic also said he had the machine available, so we should see.

@glenn-jocher
Copy link
Member

@NanoCode012 ok lets wait then and see if we can get test confirmation.

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jul 24, 2020

Sorry I misread your question, it’s a bit late here.

This code is for people wanting to train from Multiple machines(nodes). Let’s say, two pc with 2 2080Ti each. We can utilize it to train as if 4. I’m just trying to replicate it by using two terminals.

They have to give an ip (belonging to Master machine GPU 0), and it’s port. This way all the other machines can connect to it.

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jul 25, 2020

@MagicFrogSJTU confirm code works. I think this PR is good to go!

Edit: I set my own UnitTest to run again in case merge from master created problems.
Edit2: UnitTest success.
Edit3: Only train.py has been changed. Others is because of commit history.

@NanoCode012 NanoCode012 marked this pull request as ready for review July 25, 2020 06:59
@glenn-jocher
Copy link
Member

I was looking through this, and realized most of the changes shown are already in master, which makes for a confusing review. I found this: https://stackoverflow.com/questions/16306012/github-pull-request-showing-commits-that-are-already-in-target-branch

@glenn-jocher glenn-jocher changed the base branch from master to dev July 25, 2020 17:51
@glenn-jocher glenn-jocher changed the base branch from dev to master July 25, 2020 17:51
@glenn-jocher glenn-jocher changed the base branch from master to dev July 25, 2020 17:51
@glenn-jocher glenn-jocher changed the base branch from dev to master July 25, 2020 17:52
@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 25, 2020

@NanoCode012 I tried the edit button solution, switched from master to dev branch and back again but I see no changes. The other main solution is to rebase. Can you try this?

git fetch origin
git checkout feature-01
git rebase origin/master
git push --force

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jul 25, 2020

Sorry for the mess of files. It happened when I updated this branch to master. It keeps doing this. I'm not sure how to fix it. My current reasoning is that when I PR from your repo to my master, I did a SquashMerge, and from there on, it always detected the two branch's to be on different commits.

As stated in previous comment, the only changes are in train.py.

Edit: Will try your method!

@glenn-jocher
Copy link
Member

Yes I know, I don't understand why github handles diffs this way but it appears to be on purpose based on the stackoverflow responses. It makes it super complicated to understand what's actually being changed and what's not.

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jul 25, 2020

Hello, I did it on a test branch first. I don't think it works. Nothing happens.

master...NanoCode012:muti_node_test

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 25, 2020

If I use the normal PR branch I still see 7 changes:
NanoCode012/yolov5@muti_node...ultralytics:master

If I use your test branch now I only see 1: EDIT: this diff is from your master to your mutli_node_test I think?
NanoCode012/yolov5@master...muti_node_test

EDIT2: When I use the diff to ultralytics:master I still see 7 changes, from multi_node and multi_node_master. I think you might want to try running the rebase code I pasted before and see if that works.
master...NanoCode012:muti_node_test

EDIT3: The main problem seems to be that there are 6 diffs between my master and your master. I think running git pull, or merging or rebasing will solve this.
master...NanoCode012:master

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jul 25, 2020

If I use your test branch now I only see 1: EDIT: this diff is from your master to your mutli_node_test I think?
NanoCode012/yolov5@master...muti_node_test

Yes, that is from my master to my branch.

Edit: Right now, I think there was mistake with that rebase code. I needed to set it to rebase to your repo master and not mine.

@glenn-jocher
Copy link
Member

Edit: Right now, I think there was mistake with that rebase code. I needed to set it to rebase to your repo master and not mine.

I assume with this what git fetch origin solves. Note that git pull = git fetch origin + git merge

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jul 25, 2020

Hello, I finally got it working.. I had to set origin to your repo and mine back.. Good lesson for me thanks.

@glenn-jocher
Copy link
Member

@NanoCode012 hey perfect! What were your steps so we can advise others how to do this in the future?

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jul 25, 2020

The code you supplied above/from SO is for within same repo. I am from a fork. What I did was changing the remote origin, but a better way would be to create a new remote.

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature-01
git rebase upstream/master
git push -u origin -f

Change feature-01 to name of branch in fork.

@NanoCode012
Copy link
Contributor Author

Hi @glenn-jocher , would you still like this PR? How about the #500 ?

Tested working on UnitTest.

@glenn-jocher
Copy link
Member

@NanoCode012 yes. Have you tested this with multi-gpu?

@NanoCode012
Copy link
Contributor Author

Yes. DDP from one terminal and DDP from two (to mimic two different nodes). Also confirmed working by Magic and #501 .

@glenn-jocher
Copy link
Member

@NanoCode012 let's do the #500 and #504 merges both, since they are closely related. Do you have any preference on which to merge first? This one appears clear to merge, no conflicts. I'll re-run the CI checks now. Can you rebase this one also to origin/maser?

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 6, 2020

CI re-passed now. PR is ready to merge.

EDIT: waiting on thumbs up from @NanoCode012 to merge.

@glenn-jocher glenn-jocher added the TODO High priority items label Aug 6, 2020
@NanoCode012
Copy link
Contributor Author

Good to go @glenn-jocher !

@glenn-jocher glenn-jocher merged commit 886b984 into ultralytics:master Aug 6, 2020
@glenn-jocher
Copy link
Member

@NanoCode012 awesome, all done. Thanks for your hard work!

@NanoCode012 NanoCode012 deleted the muti_node branch August 7, 2020 19:15
@glenn-jocher glenn-jocher removed the TODO High priority items label Nov 12, 2020
burglarhobbit pushed a commit to burglarhobbit/yolov5 that referenced this pull request Jan 1, 2021
* Add support for multi-node DDP

* Remove local_rank confusion

* Fix spacing
KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
* Add support for multi-node DDP

* Remove local_rank confusion

* Fix spacing
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
* Add support for multi-node DDP

* Remove local_rank confusion

* Fix spacing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

subprocess.CalledProcessError with Multi-GPU Training
2 participants