Skip to content

[BUG] Multi-GPU runs #281

@JinkaiQiu

Description

@JinkaiQiu

The doc says multi GPU runs be started by torchrun --standalone --nnodes=1 --nproc-per-node=6 -m puffer train puffer_drive, it appears it should be torchrun --standalone --nnodes=1 --nproc-per-node=6 -m pufferlib.pufferl train puffer_drive, otherwise error will come out saying puffer not found.

Meanwhile, there's a bug in ensure_drive_binary() in pufferrl.py that causes race condition in multi-GPU runs. We should probably guard the function with something like:

def ensure_drive_binary():
    """Delete existing visualize binary and rebuild it. This ensures the
    binary is always up-to-date with the latest code changes.
    """
    if torch.distributed.is_initialized():
        if torch.distributed.get_rank() != 0:
            torch.distributed.barrier()
            return

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions