Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DDP for torch.distributed.run with gloo backend #3680

Merged
merged 35 commits into from
Jun 19, 2021
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
007902e
Update DDP for `torch.distributed.run`
glenn-jocher Jun 18, 2021
9bcb4ad
Add LOCAL_RANK
glenn-jocher Jun 18, 2021
b32bae0
remove opt.local_rank
glenn-jocher Jun 18, 2021
b467501
backend="gloo|nccl"
glenn-jocher Jun 18, 2021
c886538
print
glenn-jocher Jun 18, 2021
5d847dc
print
glenn-jocher Jun 18, 2021
26d0ecf
debug
glenn-jocher Jun 18, 2021
832ba4c
debug
glenn-jocher Jun 18, 2021
9a1bb01
os.getenv
glenn-jocher Jun 18, 2021
0e912df
gloo
glenn-jocher Jun 18, 2021
5f5e428
gloo
glenn-jocher Jun 18, 2021
e8493c6
gloo
glenn-jocher Jun 18, 2021
fb342fc
cleanup
glenn-jocher Jun 18, 2021
382ce4f
fix getenv
glenn-jocher Jun 18, 2021
b09b415
cleanup
glenn-jocher Jun 18, 2021
9c4ac05
cleanup destroy
glenn-jocher Jun 18, 2021
8ae9ea1
try nccl
glenn-jocher Jun 18, 2021
a18f933
merge master
glenn-jocher Jun 19, 2021
2435775
return opt
glenn-jocher Jun 19, 2021
56a4ab4
add --local_rank
glenn-jocher Jun 19, 2021
c4d839b
add timeout
glenn-jocher Jun 19, 2021
0584e7e
add init_method
glenn-jocher Jun 19, 2021
d917341
gloo
glenn-jocher Jun 19, 2021
6a1cc64
move destroy
glenn-jocher Jun 19, 2021
3581c76
move destroy
glenn-jocher Jun 19, 2021
5f5d122
move print(opt) under if RANK
glenn-jocher Jun 19, 2021
5451fc2
destroy only RANK 0
glenn-jocher Jun 19, 2021
9aa229e
move destroy inside train()
glenn-jocher Jun 19, 2021
94363ce
restore destroy outside train()
glenn-jocher Jun 19, 2021
9647379
update print(opt)
glenn-jocher Jun 19, 2021
cb8395d
merge master
glenn-jocher Jun 19, 2021
96686fd
cleanup
glenn-jocher Jun 19, 2021
446c610
nccl
glenn-jocher Jun 19, 2021
49bb0b7
gloo with 60 second timeout
glenn-jocher Jun 19, 2021
b5decde
update namespace printing
glenn-jocher Jun 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
destroy only RANK 0
  • Loading branch information
glenn-jocher committed Jun 19, 2021
commit 5451fc2493ece7133d26532f008b968c7b4279b0
2 changes: 1 addition & 1 deletion train.py
Original file line number Diff line number Diff line change
Expand Up @@ -547,7 +547,7 @@ def main(opt):
logger.info(opt)
if not opt.evolve:
train(opt.hyp, opt, device)
if WORLD_SIZE > 1:
if WORLD_SIZE > 1 and RANK == 0:
_ = [print('Destroying process group... ', end=''), dist.destroy_process_group(), print('Done.')]

# Evolve hyperparameters (optional)
Expand Down