use all_gather to gather results from all gpus #383

wat3rBro · 2019-01-24T22:46:24Z

from @ppwwyyxx

qianyizhang · 2019-01-25T03:08:15Z

whats the advantage using one over the other?

fmassa

This is much better, thanks a lot @wat3rBro and @ppwwyyxx !

I've left a few nits, but I'm merging this as this is already much better than before.

fmassa · 2019-01-25T09:23:45Z

maskrcnn_benchmark/utils/comm.py

+    return data_list
+
+
+def reduce_dict(input_dict, average=True):


nit: I believe this is not used anywhere (it probably is due to a refactor in the engine/trainer?)

fmassa · 2019-01-25T09:25:09Z

maskrcnn_benchmark/utils/comm.py

+    for _ in size_list:
+        tensor_list.append(torch.ByteTensor(size=(max_size,)).to("cuda"))
+    if local_size != max_size:
+        padding = torch.ByteTensor(size=(max_size - local_size,)).to("cuda")


nit: this could generate NaN because the data is unnitialized. This per se doesn't affect the overall results because we remove the padded value, but I'm not sure if it could cause problems with dist.all_gather.

fmassa · 2019-01-25T09:26:26Z

maskrcnn_benchmark/utils/comm.py

+    # gathering tensors of different shapes
+    tensor_list = []
+    for _ in size_list:
+        tensor_list.append(torch.ByteTensor(size=(max_size,)).to("cuda"))


nit: it would be good to use the new API for this:

device=torch.device("cuda") ... torch.empty((max_size,), dtype=torch.uint8, device=device

fmassa · 2019-01-25T09:29:39Z

maskrcnn_benchmark/utils/comm.py

+    tensor = torch.ByteTensor(storage).to("cuda")
+
+    # obtain Tensor size of each rank
+    local_size = torch.IntTensor([tensor.numel()]).to("cuda")


nit: It would be good to replace the usages of IntTensor/ByteTensor with their new constructs. In this case (because it's initialized):

device = torch.device("cuda") ... local_size = torch.tensor([tensor.numel()], dtype=torch.int32, device=device) # or, if 0d tensors work with dist.all_gather local_size = torch.tensor(tensor.numel(), dtype=torch.int32, device=device)

fmassa · 2019-01-25T09:31:19Z

@qianyizhang the advantage is that it makes it possible to do multi-machine testing, which was not possible before.

qianyizhang · 2019-01-25T09:37:38Z

@fmassa
thanks. Also I have another somewhat related question
Is it possible to run multiple sessions of maskrcnn_benchmark (or even torch.distributed processes) on the same node?
for example i have a server with 8 gpus and i want to run 2 sets of experiments using 4 cards each.

I got the rendezvous complaining "RuntimeError: Address already in use", how to make it work?

fmassa · 2019-01-25T09:40:29Z

@qianyizhang yes, it's possible, but you need to change the master_addr and the master_port in torch.distributed.launch, see https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py#L164-L169

fmassa · 2019-01-25T12:52:24Z

maskrcnn_benchmark/utils/comm.py

+    # serialized to a Tensor
+    buffer = pickle.dumps(data)
+    storage = torch.ByteStorage.from_buffer(buffer)
+    tensor = torch.ByteTensor(storage).to("cuda")


Actually, @ppwwyyxx this will probably be problematic for large datasets, as we will run out of memory on the GPU when trying to perform this communication.

The idea that I had was to use shared memory on the CPU and communicate the address of the shared memory, but this doesn't work on the multiple-machine case.

I agree. But is there any way to do all-gather on CPUs (given that the dist backend was initialized with "nccl")?

With c10d, it is now possible to have more than one dist backend at a time. So one could potentially have one nccl backend and one mpi backend?

Yes, I am using these communication codes in other task. Because of the large size of each data, I get OOM error.

@yelantingfeng I'd recommend either:

reverting this change locally for now

try creating a new process group with c10d which is on the CPU, and communicate this data on the CPU instead

Yes, I am using these communication codes in other task. Because of the large size of each data, I get OOM error.
I got the OOM error. Have you implement the second method recommended by @fmassa ?

@yelantingfeng I'd recommend either:

reverting this change locally for now

try creating a new process group with c10d which is on the CPU, and communicate this data on the CPU instead

I think we could just set a memory limit for this all_gather function. I implement this by splitting those ByteTensors into chunks. After some tests, I found my implementation can limit the total usage of memory to MiB level. Though not 100% precise limit, but I think it should be useful enough. I would be glad to send a PR if you think this is a good improvement for this repository.

pietern · 2019-01-30T16:35:50Z

maskrcnn_benchmark/utils/comm.py

+    size_list = [torch.IntTensor([0]).to("cuda") for _ in range(world_size)]
+    dist.all_gather(size_list, local_size)
+    size_list = [int(size.item()) for size in size_list]
+    max_size = max(size_list)


@wat3rBro You can use dist.all_reduce(local_size, op=dist.ReduceOp.MAX) here for a little less code.

The sizes of all ranks are needed later.

Ah, I see, because of pickle requiring the exact size and doesn't tolerate additional NULs. Thanks for clarifying.

JoyHuYY1412 · 2019-08-27T02:40:44Z

This is much better, thanks a lot @wat3rBro and @ppwwyyxx !

I've left a few nits, but I'm merging this as this is already much better than before.

HI, Can I use 'all_gather' to gather the weights of all gpus?
Similar to the issue here https://discuss.pytorch.org/t/how-to-preserve-backward-grad-fn-after-distributed-operations/49343

I want to achieve in each batch, different gpu outputs different weights, and the loss will be calculated using all the weights. When I use 'all_gather', I found the output accumulated weights loss the grad_fn.

use all_gather to gather results from all gpus

efd3309

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 24, 2019

fmassa approved these changes Jan 25, 2019

View reviewed changes

fmassa merged commit 5f2a826 into facebookresearch:master Jan 25, 2019

fmassa reviewed Jan 25, 2019

View reviewed changes

pietern reviewed Jan 30, 2019

View reviewed changes

Lyears pushed a commit to Lyears/maskrcnn-benchmark that referenced this pull request Jun 28, 2020

use all_gather to gather results from all gpus (facebookresearch#383)

32fc588

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use all_gather to gather results from all gpus #383

use all_gather to gather results from all gpus #383

wat3rBro commented Jan 24, 2019

qianyizhang commented Jan 25, 2019

fmassa left a comment

fmassa Jan 25, 2019

fmassa Jan 25, 2019

fmassa Jan 25, 2019

fmassa Jan 25, 2019

fmassa commented Jan 25, 2019

qianyizhang commented Jan 25, 2019

fmassa commented Jan 25, 2019

fmassa Jan 25, 2019

ppwwyyxx Jan 28, 2019

fmassa Jan 28, 2019

yelantf Feb 27, 2019 •

edited

Loading

fmassa Feb 28, 2019

liu09114 Mar 19, 2019

yelantf Sep 16, 2019

pietern Jan 30, 2019 •

edited

Loading

ppwwyyxx Jan 30, 2019

pietern Jan 30, 2019

JoyHuYY1412 commented Aug 27, 2019 •

edited

Loading

use all_gather to gather results from all gpus #383

use all_gather to gather results from all gpus #383

Conversation

wat3rBro commented Jan 24, 2019

qianyizhang commented Jan 25, 2019

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmassa commented Jan 25, 2019

qianyizhang commented Jan 25, 2019

fmassa commented Jan 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yelantf Feb 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pietern Jan 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoyHuYY1412 commented Aug 27, 2019 • edited Loading

yelantf Feb 27, 2019 •

edited

Loading

pietern Jan 30, 2019 •

edited

Loading

JoyHuYY1412 commented Aug 27, 2019 •

edited

Loading