-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Multi-Node support for DDP Training #504
Conversation
@NanoCode012 wow these use cases are getting complicated. So this is if someone wants to run multiple multi-gpu trainings on a single machine? Should I merge or should we wait for further testing? |
Ok. If they want to run multiple multi-gpu trainings on single machine, they only need to choose a different master_port. It’s getting harder to test with these many configs. Currently, I assume it should work for multiple machines But, I have not tested it, so I cannot say for sure. There is one Issue open on it, I hope they test it.. Magic also said he had the machine available, so we should see. |
@NanoCode012 ok lets wait then and see if we can get test confirmation. |
Sorry I misread your question, it’s a bit late here. This code is for people wanting to train from Multiple machines(nodes). Let’s say, two pc with 2 2080Ti each. We can utilize it to train as if 4. I’m just trying to replicate it by using two terminals. They have to give an ip (belonging to Master machine GPU 0), and it’s port. This way all the other machines can connect to it. |
@MagicFrogSJTU confirm code works. I think this PR is good to go! Edit: I set my own UnitTest to run again in case merge from master created problems. |
I was looking through this, and realized most of the changes shown are already in master, which makes for a confusing review. I found this: https://stackoverflow.com/questions/16306012/github-pull-request-showing-commits-that-are-already-in-target-branch |
@NanoCode012 I tried the edit button solution, switched from master to dev branch and back again but I see no changes. The other main solution is to rebase. Can you try this? git fetch origin
git checkout feature-01
git rebase origin/master
git push --force |
Sorry for the mess of files. It happened when I updated this branch to master. It keeps doing this. I'm not sure how to fix it. My current reasoning is that when I PR from your repo to my master, I did a SquashMerge, and from there on, it always detected the two branch's to be on different commits. As stated in previous comment, the only changes are in Edit: Will try your method! |
Yes I know, I don't understand why github handles diffs this way but it appears to be on purpose based on the stackoverflow responses. It makes it super complicated to understand what's actually being changed and what's not. |
Hello, I did it on a test branch first. I don't think it works. Nothing happens. |
If I use the normal PR branch I still see 7 changes: If I use your test branch now I only see 1: EDIT: this diff is from your master to your mutli_node_test I think? EDIT2: When I use the diff to ultralytics:master I still see 7 changes, from multi_node and multi_node_master. I think you might want to try running the rebase code I pasted before and see if that works. EDIT3: The main problem seems to be that there are 6 diffs between my master and your master. I think running git pull, or merging or rebasing will solve this. |
Yes, that is from my master to my branch. Edit: Right now, I think there was mistake with that rebase code. I needed to set it to rebase to your repo master and not mine. |
I assume with this what |
Hello, I finally got it working.. I had to set origin to your repo and mine back.. Good lesson for me thanks. |
@NanoCode012 hey perfect! What were your steps so we can advise others how to do this in the future? |
The code you supplied above/from SO is for within same repo. I am from a fork. What I did was changing the remote origin, but a better way would be to create a new remote. git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature-01
git rebase upstream/master
git push -u origin -f Change |
Hi @glenn-jocher , would you still like this PR? How about the #500 ? Tested working on UnitTest. |
@NanoCode012 yes. Have you tested this with multi-gpu? |
Yes. DDP from one terminal and DDP from two (to mimic two different nodes). Also confirmed working by Magic and #501 . |
@NanoCode012 let's do the #500 and #504 merges both, since they are closely related. Do you have any preference on which to merge first? This one appears clear to merge, no conflicts. I'll re-run the CI checks now. Can you rebase this one also to origin/maser? |
CI re-passed now. PR is ready to merge. EDIT: waiting on thumbs up from @NanoCode012 to merge. |
Good to go @glenn-jocher ! |
@NanoCode012 awesome, all done. Thanks for your hard work! |
* Add support for multi-node DDP * Remove local_rank confusion * Fix spacing
* Add support for multi-node DDP * Remove local_rank confusion * Fix spacing
* Add support for multi-node DDP * Remove local_rank confusion * Fix spacing
Fix #501 .
Unit test (CPU, Single GPU, DP mode, 2 GPU DDP mode)
Single machine, two terminal
Multiple machines, one terminal each
It works on my single machine from two different terminals.
Command:
Please set the ip and port to somewhere that is not blocked if you are running from two machines.
If you are running from one machine, you can omit the
master_addr
andport
.Would it be necessary to run it on COCO2017 to see the training time decrease?
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Improved Distributed Data Parallel (DDP) support with rank handling in training code.
📊 Key Changes
opt.local_rank
toopt.global_rank
to better represent the variable's intent.opt.local_rank
instead ofrank
.opt.global_rank
with a default value and updated it within the DDP setup.🎯 Purpose & Impact
global_rank
, which reflects its usage more accurately thanlocal_rank
.