log_every_n_steps
is broken for Trainer.{validate,test,predict}
#10436
Labels
bug
Something isn't working
logging
Related to the `LoggerConnector` and `log()`
priority: 1
Medium priority task
Milestone
The huge issue today with
log_every_n_steps
is that with high probability, it is broken for Trainer.validate, Trainer.test, and Trainer.predict.log_every_n_steps
works with the trainer'sglobal_step
to determine if data is going to be logged: https://github.com/PyTorchLightning/pytorch-lightning/blob/f9b9cdb0d1d4e26d25fe13f19f12ea88690aa0a8/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L74-L77global_step
is defined as the number of parameter updates that occur. This number is incremented only during fitting. It is never incremented for validation, test, or prediction routines.Therefore, if your LightningModule calls
self.log
inside of validate, test, or predict steps, it is highly likely that your data will not be logged! You would have to get lucky that the global step is a multiple oflog_every_n_steps
from a prior fitting routine. It is not obvious at all to an end user why callingself.log
results in no data being updated. This has also been a consistent complaint amongst users of Lightning that I'm aware of, especially when trying to log metrics with Trainer.test.The workaround we have is to set
Trainer(log_every_n_steps=1)
and gateLightningModule.log
with another flag for the log frequency. This way, users can at least control the granularity with which they want to log without any surprising behavior from the Lightning framework interfering with this.We need to update the logger connector code to take into account the batch_idx if validate, test, or predict were called instead of the global step
Originally posted by @ananthsub in #9726 (comment)
cc @tchaton @carmocca
The text was updated successfully, but these errors were encountered: