Loss abruptly becomes 'nan' during Self-Supervised training #2
Description
In the ResNet9_Barlow_Twins.ipynb notebook, the model and the training function compiled and trained on ~13 epochs successfully, however subsequently the loss abruptly becomes nan
. This in turn was occuring due to the gradient becoming nan
.
Debugging:
-
torch.autograd.set_detect_anomaly(True) was used to trace what part of the code was causing there to be
nan
values. -
Error was traced to be: RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
-
A simpler model (AlexNet) was used in the AlexNet_Barlow_Twins.ipynb notebook. The error persisted.
-
Gradient clipping was used to ensure that the gradients do not explode. Division by zero was also prevented at all stages by adding a small positive constant wherever required.
-
We also tried using Facebook research's implementation of the Barlow Twins loss function and the LARS optimizer.