Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple optimizers + half precision + skip first optimization step Bug #7792

Closed
matyushinleonid opened this issue Jun 1, 2021 · 2 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@matyushinleonid
Copy link

🐛 Bug

When first batch is skipped (return None in training_step) in multiple optimizers (different model parameters belong to different optimizers) and half-precision setting, the following error appears (you can find full traceback on colab):

/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py in step(self, optimizer, *args, **kwargs)
    335             self.unscale_(optimizer)
    336 
--> 337         assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
    338 
    339         retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)

AssertionError: No inf checks were recorded for this optimizer.

Note that

  • There are no errors when I skip any other batch (not first)
  • There are no errors when I skip first batch in single optimizer setting
  • There are no errors when I skip first batch in full-precision setting

So all three conditions (multiple optimizers, half-precision, first batch) are crucial to reproduce this bug.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1a4XCOSumDxy2B3ywu6TrIgx5j1COcSfu?usp=sharing

Expected behavior

Users should be able to skip an optimization step whenever they want.

Environment

  • CUDA:
    • GPU:
      • Tesla K80
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.8.1+cu101
    • pytorch-lightning: 1.3.3
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.10
    • version: Proposal for help #1 SMP Tue Apr 20 19:55:43 PDT 2021
@matyushinleonid matyushinleonid added bug Something isn't working help wanted Open to be worked on labels Jun 1, 2021
@yifuwang
Copy link
Contributor

yifuwang commented Jun 1, 2021

Seems like the same problems as: #4524

@carmocca
Copy link
Contributor

carmocca commented Jun 1, 2021

Correct, closing as duplicate.

This is briefly mentioned in the first note of https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#training-step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants