Skip to content

Redundant Validation When Resuming Training #11504

Closed
@eladsegal

Description

@eladsegal

🐛 Bug

When training is resumed from a checkpoint, the following happens for the first epoch of the resumed run:

  1. Validation
  2. Training
  3. Validation

To Reproduce

https://colab.research.google.com/drive/1UxXoTVFusy8xnFW-ZhodLbjzewSdKsHq?usp=sharing
The model in the notebook is trained for 2 epochs.
In the prints, you can see that for the original training both epochs are run correctly with one validation per epoch, after the training epoch is completed.
When training is resumed from the checkpoint of the first epoch, it can be seen that the epoch has two validation runs, before and after the training.

Expected behavior

There should be only one validation run per epoch, and it should be after the training.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 11.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0+cu111
    • pytorch-lightning: 1.6.0dev
    • tqdm: 4.62.3
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.12

cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomersloopsRelated to the Loop API

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions