You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When first batch is skipped (return None in training_step) in multiple optimizers (different model parameters belong to different optimizers) and half-precision setting, the following error appears (you can find full traceback on colab):
/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py in step(self, optimizer, *args, **kwargs)
335 self.unscale_(optimizer)
336
--> 337 assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
338
339 retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
AssertionError: No inf checks were recorded for this optimizer.
Note that
There are no errors when I skip any other batch (not first)
There are no errors when I skip first batch in single optimizer setting
There are no errors when I skip first batch in full-precision setting
So all three conditions (multiple optimizers, half-precision, first batch) are crucial to reproduce this bug.
🐛 Bug
When first batch is skipped (return
None
intraining_step
) in multiple optimizers (different model parameters belong to different optimizers) and half-precision setting, the following error appears (you can find full traceback on colab):Note that
So all three conditions (multiple optimizers, half-precision, first batch) are crucial to reproduce this bug.
Please reproduce using the BoringModel
https://colab.research.google.com/drive/1a4XCOSumDxy2B3ywu6TrIgx5j1COcSfu?usp=sharing
Expected behavior
Users should be able to skip an optimization step whenever they want.
Environment
The text was updated successfully, but these errors were encountered: