Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.

MPI backend mnist gpu example error: "No space left on device" #91

@jwwandy

Description

@jwwandy

def average_gradients(model):
""" Gradient averaging. """
size = float(dist.get_world_size())
group = dist.new_group([0])
for param in model.parameters():
dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM, group=group)

When running MPI gpu examples with more epochs, average_gradients called dist.new_group in every backward iterations, which creates new ProcessGroupMPI each time and raises No space left on device error after few epocs as below. Also, original code creates group with group = dist.new_group([0]) instead of whole MPI world for back propagation.

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions