auto resume from latest checkpoint for FBLearner retry job #1046

hudeven · 2019-10-11T21:08:17Z

Summary:

Problem:

When a job fail at epoch X, FBLearner will retry it 3 times, from scratch(epoch1)
It wastes time and resources

Solution:

The "retry" shares the same FBLearner job_id
we save all checkpoints at "manifold://pytext_training/tree/jobs/<JOB_ID>/checkpoint-"
so there is a mapping (job_id --> latest checkpoint)
before training start, we check if there are already checkpoints saved for the current job_id.
Yes: load the latest checkpoint and resume training
No: start training from scratch
add config "auto_resume_from_snapshot" to turn this feature on/off.

Differential Revision: D17874151

facebook-github-bot · 2019-10-11T21:08:33Z

This pull request was exported from Phabricator. Differential Revision: D17874151

…esearch#1046) Summary: Pull Request resolved: facebookresearch#1046 ## Problem: When a job fail at epoch X, FBLearner will retry it 3 times, from scratch(epoch1) It wastes time and resources ## Solution: 1. The "retry" shares the same FBLearner job_id 2. we save all checkpoints at "manifold://pytext_training/tree/jobs/<JOB_ID>/checkpoint-<EPOCH>" 3. so there is a mapping (job_id --> latest checkpoint) 4. before training start, we check if there are already checkpoints saved for the current job_id. Yes: load the latest checkpoint and resume training No: start training from scratch 5. add config "auto_resume_from_snapshot" to turn this feature on/off. Reviewed By: mwu1993 Differential Revision: D17874151 fbshipit-source-id: 6778e68e90523e982c141b76e916cc7d3038de8f

facebook-github-bot · 2019-10-16T03:38:17Z

This pull request was exported from Phabricator. Differential Revision: D17874151

…esearch#1046) Summary: Pull Request resolved: facebookresearch#1046 ## Problem: When a job fail at epoch X, FBLearner will retry it 3 times, from scratch(epoch1) It wastes time and resources ## Solution: 1. The "retry" shares the same FBLearner job_id 2. we save all checkpoints at "manifold://pytext_training/tree/jobs/<JOB_ID>/checkpoint-<EPOCH>" 3. so there is a mapping (job_id --> latest checkpoint) 4. before training start, we check if there are already checkpoints saved for the current job_id. Yes: load the latest checkpoint and resume training No: start training from scratch 5. add config "auto_resume_from_snapshot" to turn this feature on/off. Reviewed By: mwu1993 Differential Revision: D17874151 fbshipit-source-id: ec17cbd1a5f5022b47e2371e9063c738b67d9018

facebook-github-bot · 2019-10-16T17:26:13Z

This pull request was exported from Phabricator. Differential Revision: D17874151

facebook-github-bot · 2019-10-17T02:43:05Z

This pull request has been merged in 8f93ce1.

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Oct 11, 2019

hudeven force-pushed the export-D17874151 branch from ae495fe to 904c3f7 Compare October 16, 2019 03:38

hudeven force-pushed the export-D17874151 branch from 904c3f7 to 3490927 Compare October 16, 2019 17:26

facebook-github-bot closed this in 8f93ce1 Oct 17, 2019

facebook-github-bot added the Merged label Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto resume from latest checkpoint for FBLearner retry job #1046

auto resume from latest checkpoint for FBLearner retry job #1046

hudeven commented Oct 11, 2019

facebook-github-bot commented Oct 11, 2019

facebook-github-bot commented Oct 16, 2019

facebook-github-bot commented Oct 16, 2019

facebook-github-bot commented Oct 17, 2019

auto resume from latest checkpoint for FBLearner retry job #1046

auto resume from latest checkpoint for FBLearner retry job #1046

Conversation

hudeven commented Oct 11, 2019

Problem:

Solution:

facebook-github-bot commented Oct 11, 2019

facebook-github-bot commented Oct 16, 2019

facebook-github-bot commented Oct 16, 2019

facebook-github-bot commented Oct 17, 2019