Closed
Description
opened on Mar 10, 2021
I have a custom megatron model and a corresponding custom DeepSpeed. I believe that I have incorporated your recent update correctly, but when I try to train a ZeRO 3 model I get the error RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0.
When I turn off CPU adam, I instead get this error RuntimeError: start (0) + length (174763) exceeds dimension size (174761)
I notice in both cases the shape of a tensor seems to be off by 2, but I have no idea what's causing this. My code is overall extremely similar to yours, though as I note at microsoft/DeepSpeedExamples#92 I cannot get your code to run either (though for different reasons).
Metadata
Assignees
Labels
No labels
Activity