`log_every_n_steps` is broken for `Trainer.{validate,test,predict}` #10436

carmocca · 2021-11-09T15:29:57Z

The huge issue today with log_every_n_steps is that with high probability, it is broken for Trainer.validate, Trainer.test, and Trainer.predict.

log_every_n_steps works with the trainer's global_step to determine if data is going to be logged: https://github.com/PyTorchLightning/pytorch-lightning/blob/f9b9cdb0d1d4e26d25fe13f19f12ea88690aa0a8/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L74-L77

global_step is defined as the number of parameter updates that occur. This number is incremented only during fitting. It is never incremented for validation, test, or prediction routines.

Therefore, if your LightningModule calls self.log inside of validate, test, or predict steps, it is highly likely that your data will not be logged! You would have to get lucky that the global step is a multiple of log_every_n_steps from a prior fitting routine. It is not obvious at all to an end user why calling self.log results in no data being updated. This has also been a consistent complaint amongst users of Lightning that I'm aware of, especially when trying to log metrics with Trainer.test.

The workaround we have is to set Trainer(log_every_n_steps=1) and gate LightningModule.log with another flag for the log frequency. This way, users can at least control the granularity with which they want to log without any surprising behavior from the Lightning framework interfering with this.

We need to update the logger connector code to take into account the batch_idx if validate, test, or predict were called instead of the global step

Originally posted by @ananthsub in #9726 (comment)

cc @tchaton @carmocca

The text was updated successfully, but these errors were encountered:

signalprime · 2023-04-27T08:14:22Z

Looks like this has been a persistent issue since 2021. Thoughts on how it can be fixed? I'm sure we all would like more updates to our validation curves when using PyTorch Lightning AI. Is there a PR started?

carmocca added bug Something isn't working logging Related to the `LoggerConnector` and `log()` labels Nov 9, 2021

tchaton added priority: 0 High priority task priority: 1 Medium priority task and removed priority: 0 High priority task labels Nov 15, 2021

rohitgr7 self-assigned this Sep 2, 2022

rohitgr7 added this to the pl:1.8 milestone Sep 2, 2022

carmocca modified the milestones: v1.8, future Oct 13, 2022

carmocca unassigned rohitgr7 Nov 7, 2022

Borda self-assigned this Nov 7, 2022

nicoloesch mentioned this issue Feb 20, 2023

Change in logging frequency for automated logging #16821

Open

Borda removed their assignment Apr 27, 2023

carmocca mentioned this issue Oct 30, 2023

Support log_every_n_steps with validate|test #18895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`log_every_n_steps` is broken for `Trainer.{validate,test,predict}` #10436

`log_every_n_steps` is broken for `Trainer.{validate,test,predict}` #10436

carmocca commented Nov 9, 2021 •

edited by github-actions bot

Loading

signalprime commented Apr 27, 2023

log_every_n_steps is broken for Trainer.{validate,test,predict} #10436

log_every_n_steps is broken for Trainer.{validate,test,predict} #10436

Comments

carmocca commented Nov 9, 2021 • edited by github-actions bot Loading

signalprime commented Apr 27, 2023

`log_every_n_steps` is broken for `Trainer.{validate,test,predict}` #10436

`log_every_n_steps` is broken for `Trainer.{validate,test,predict}` #10436

carmocca commented Nov 9, 2021 •

edited by github-actions bot

Loading