Skip to content

Trouble with the backward pass in ZeRO 3 #846

Closed
@StellaAthena

Description

I have a custom megatron model and a corresponding custom DeepSpeed. I believe that I have incorporated your recent update correctly, but when I try to train a ZeRO 3 model I get the error RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0.

When I turn off CPU adam, I instead get this error RuntimeError: start (0) + length (174763) exceeds dimension size (174761)

I notice in both cases the shape of a tensor seems to be off by 2, but I have no idea what's causing this. My code is overall extremely similar to yours, though as I note at microsoft/DeepSpeedExamples#92 I cannot get your code to run either (though for different reasons).

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions