NaN loss after some epochs #2637
Replies: 5 comments 6 replies
-
Hi @talhaanwarch , Thanks for your interest here. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi, I customized a model for Brats21 segmentation and used DiceCELoss under PyTorch Lighting amp. After training for some time, my detection indicated NAN in softMax calculation, but using the model on MONAI did not have these problems. Do you have any solution? My model already uses monai components for the most part. |
Beta Was this translation helpful? Give feedback.
-
@talhaanwarch Hi, Could you manage to work it out? |
Beta Was this translation helpful? Give feedback.
-
1e-3 is quite a large learning rate. Try with more like 1e-5 first, if that works, then you can increase slightly. |
Beta Was this translation helpful? Give feedback.
-
tl;dr Add the SignalFillEmpty transforms to remove any NaNs after the input transforms We have discovered one possible reason for this problem. In our case the NaNs errors only triggered when amp was on. Switching it off made the error disappear and training worked as expected. Nevertheless my mentor @Zrrr1997 dug a bit deeper and found out our input already contained NaNs. The solution: I created a SignalFillEmptyd analouge to SignalFillEmpty, which casts all input NaN values to 0 before passing the input to the network. I will PR the transform in the next days and hope this helps other people as well. As to why this bug only triggers with AMP, more debugging is necessary, so far we don't understand it either. Some additional information, maybe that also helps: I only saw this error pop up on the A100 and I have no clue why, On Rtx 3090 Ti, A6000 and even H100 the error did not occur (at least regularily). This could be a hardware bug for all I know, at least the occurence of this problem is highly hardware related. If we gather any more information I will update this post. Update: SignalFillEmptyd got accepted and should be available in the next MONAI release, for anyone who wants to try it out |
Beta Was this translation helpful? Give feedback.
-
After some epochs loss turned out to be NaN, while doing 3D segmentation
Beta Was this translation helpful? Give feedback.
All reactions