Closed
Description
🐛 Bug
When training is resumed from a checkpoint, the following happens for the first epoch of the resumed run:
- Validation
- Training
- Validation
To Reproduce
https://colab.research.google.com/drive/1UxXoTVFusy8xnFW-ZhodLbjzewSdKsHq?usp=sharing
The model in the notebook is trained for 2 epochs.
In the prints, you can see that for the original training both epochs are run correctly with one validation per epoch, after the training epoch is completed.
When training is resumed from the checkpoint of the first epoch, it can be seen that the epoch has two validation runs, before and after the training.
Expected behavior
There should be only one validation run per epoch, and it should be after the training.
Environment
- CUDA:
- GPU:
- Tesla T4
- available: True
- version: 11.1
- GPU:
- Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0+cu111
- pytorch-lightning: 1.6.0dev
- tqdm: 4.62.3
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.7.12
cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7