Resuming from checkpoint, mid epoch gives a very distorted time estimate #18220
Labels
bug
Something isn't working
help wanted
Open to be worked on
priority: 1
Medium priority task
progress bar: tqdm
ver: 2.0.x
Bug description
This might be related to : #13124
Currently when resuming from a deepspeed checkpoint, it seems like the time estimate, uses the "current running time" against the "total dataset steps". This gives incredibly warped numbers, when resuming mid epoch for long 1 day+ runs.
Where you can see estimates for runs which would have taken hours, being in minutes.
This can be observed even in small dataset / models, where you can see crazy it/s rates - and remaining time estimate - at the start, which improves over time, but never fall back inline with a more realistic estimate (especially if it resumed > 50% mark)
I do not have full repro steps here, but i am filing it so that others might be able to confirm / follow up on it.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
No response
Error messages and logs
No response
Environment
Current environment
More info
No response
cc @tchaton @awaelchli
The text was updated successfully, but these errors were encountered: