forked from PufferAI/PufferLib
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't working
Description
The doc says multi GPU runs be started by torchrun --standalone --nnodes=1 --nproc-per-node=6 -m puffer train puffer_drive, it appears it should be torchrun --standalone --nnodes=1 --nproc-per-node=6 -m pufferlib.pufferl train puffer_drive, otherwise error will come out saying puffer not found.
Meanwhile, there's a bug in ensure_drive_binary() in pufferrl.py that causes race condition in multi-GPU runs. We should probably guard the function with something like:
def ensure_drive_binary():
"""Delete existing visualize binary and rebuild it. This ensures the
binary is always up-to-date with the latest code changes.
"""
if torch.distributed.is_initialized():
if torch.distributed.get_rank() != 0:
torch.distributed.barrier()
return
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
Backlog