Skip to content

Loss abruptly becomes 'nan' during Self-Supervised training #2

Open
@hrishi508

Description

In the ResNet9_Barlow_Twins.ipynb notebook, the model and the training function compiled and trained on ~13 epochs successfully, however subsequently the loss abruptly becomes nan. This in turn was occuring due to the gradient becoming nan.

Debugging:

  1. torch.autograd.set_detect_anomaly(True) was used to trace what part of the code was causing there to be nan values.

  2. Error was traced to be: RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

  3. A simpler model (AlexNet) was used in the AlexNet_Barlow_Twins.ipynb notebook. The error persisted.

  4. Gradient clipping was used to ensure that the gradients do not explode. Division by zero was also prevented at all stages by adding a small positive constant wherever required.

  5. We also tried using Facebook research's implementation of the Barlow Twins loss function and the LARS optimizer.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions