Loss abruptly becomes 'nan' during Self-Supervised training

In the [ResNet9_Barlow_Twins.ipynb](https://github.com/hrishi508/Self-and-Semi-Supervised-Learning/blob/main/ResNet9_Barlow_Twins.ipynb) notebook, the model and the training function compiled and trained on ~13 epochs successfully, however subsequently the loss abruptly becomes ```nan```. This in turn was occuring due to the gradient becoming ```nan```.

**Debugging:**<br>

1. **torch.autograd.set_detect_anomaly(True)** was used to trace what part of the code was causing there to be ```nan``` values.

2. **Error was traced to be:** *RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.*

3. A simpler model (AlexNet) was used in the [AlexNet_Barlow_Twins.ipynb](AlexNet_Barlow_Twins.ipynb) notebook. The error persisted.

4. Gradient clipping was used to ensure that the gradients do not explode. Division by zero was also prevented at all stages by adding a small positive constant wherever required.

5. We also tried using Facebook research's [implementation](https://github.com/facebookresearch/barlowtwins) of the Barlow Twins loss function and the LARS optimizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss abruptly becomes 'nan' during Self-Supervised training #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development