Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming from checkpoint, mid epoch gives a very distorted time estimate #18220

Open
PicoCreator opened this issue Aug 3, 2023 · 1 comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task progress bar: tqdm ver: 2.0.x

Comments

@PicoCreator
Copy link

PicoCreator commented Aug 3, 2023

Bug description

This might be related to : #13124

Currently when resuming from a deepspeed checkpoint, it seems like the time estimate, uses the "current running time" against the "total dataset steps". This gives incredibly warped numbers, when resuming mid epoch for long 1 day+ runs.

Where you can see estimates for runs which would have taken hours, being in minutes.

Screenshot 2023-08-03 at 4 00 40 PM

This can be observed even in small dataset / models, where you can see crazy it/s rates - and remaining time estimate - at the start, which improves over time, but never fall back inline with a more realistic estimate (especially if it resumed > 50% mark)

I do not have full repro steps here, but i am filing it so that others might be able to confirm / follow up on it.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

No response

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @tchaton @awaelchli

@PicoCreator PicoCreator added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Aug 3, 2023
@awaelchli awaelchli added progress bar: tqdm help wanted Open to be worked on priority: 1 Medium priority task and removed needs triage Waiting to be triaged by maintainers labels Aug 3, 2023
@awaelchli awaelchli changed the title [minor bug] Resuming from checkpoint, mid epoch gives a very distorted time estimate Resuming from checkpoint, mid epoch gives a very distorted time estimate Aug 3, 2023
@awaelchli
Copy link
Contributor

To resolve this, we probably need to tweak a setting on the tqdm progress bar when resuming. Need to investigate, thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task progress bar: tqdm ver: 2.0.x
Projects
None yet
Development

No branches or pull requests

2 participants