Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log_every_n_steps is broken for Trainer.{validate,test,predict} #10436

Open
carmocca opened this issue Nov 9, 2021 · 1 comment
Open

log_every_n_steps is broken for Trainer.{validate,test,predict} #10436

carmocca opened this issue Nov 9, 2021 · 1 comment
Labels
bug Something isn't working logging Related to the `LoggerConnector` and `log()` priority: 1 Medium priority task
Milestone

Comments

@carmocca
Copy link
Contributor

carmocca commented Nov 9, 2021

The huge issue today with log_every_n_steps is that with high probability, it is broken for Trainer.validate, Trainer.test, and Trainer.predict.

log_every_n_steps works with the trainer's global_step to determine if data is going to be logged: https://github.com/PyTorchLightning/pytorch-lightning/blob/f9b9cdb0d1d4e26d25fe13f19f12ea88690aa0a8/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L74-L77

global_step is defined as the number of parameter updates that occur. This number is incremented only during fitting. It is never incremented for validation, test, or prediction routines.

Therefore, if your LightningModule calls self.log inside of validate, test, or predict steps, it is highly likely that your data will not be logged! You would have to get lucky that the global step is a multiple of log_every_n_steps from a prior fitting routine. It is not obvious at all to an end user why calling self.log results in no data being updated. This has also been a consistent complaint amongst users of Lightning that I'm aware of, especially when trying to log metrics with Trainer.test.

The workaround we have is to set Trainer(log_every_n_steps=1) and gate LightningModule.log with another flag for the log frequency. This way, users can at least control the granularity with which they want to log without any surprising behavior from the Lightning framework interfering with this.

We need to update the logger connector code to take into account the batch_idx if validate, test, or predict were called instead of the global step

Originally posted by @ananthsub in #9726 (comment)

cc @tchaton @carmocca

@carmocca carmocca added bug Something isn't working logging Related to the `LoggerConnector` and `log()` labels Nov 9, 2021
@tchaton tchaton added priority: 0 High priority task priority: 1 Medium priority task and removed priority: 0 High priority task labels Nov 15, 2021
@rohitgr7 rohitgr7 self-assigned this Sep 2, 2022
@rohitgr7 rohitgr7 added this to the pl:1.8 milestone Sep 2, 2022
@carmocca carmocca modified the milestones: v1.8, future Oct 13, 2022
@Borda Borda self-assigned this Nov 7, 2022
@signalprime
Copy link

Looks like this has been a persistent issue since 2021. Thoughts on how it can be fixed? I'm sure we all would like more updates to our validation curves when using PyTorch Lightning AI. Is there a PR started?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logging Related to the `LoggerConnector` and `log()` priority: 1 Medium priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants