Skip to content
This repository was archived by the owner on Nov 22, 2022. It is now read-only.

auto resume from latest checkpoint for FBLearner retry job #1046

Closed
wants to merge 1 commit into from

Conversation

hudeven
Copy link
Contributor

@hudeven hudeven commented Oct 11, 2019

Summary:

Problem:

When a job fail at epoch X, FBLearner will retry it 3 times, from scratch(epoch1)
It wastes time and resources

Solution:

  1. The "retry" shares the same FBLearner job_id
  2. we save all checkpoints at "manifold://pytext_training/tree/jobs/<JOB_ID>/checkpoint-"
  3. so there is a mapping (job_id --> latest checkpoint)
  4. before training start, we check if there are already checkpoints saved for the current job_id.
    Yes: load the latest checkpoint and resume training
    No: start training from scratch
  5. add config "auto_resume_from_snapshot" to turn this feature on/off.

Differential Revision: D17874151

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Oct 11, 2019
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17874151

hudeven added a commit to hudeven/pytext that referenced this pull request Oct 16, 2019
…esearch#1046)

Summary:
Pull Request resolved: facebookresearch#1046

## Problem:
When a job fail at epoch X, FBLearner will retry it 3 times, from scratch(epoch1)
It wastes time and resources

## Solution:
1. The "retry" shares the same FBLearner job_id
2. we save all checkpoints at "manifold://pytext_training/tree/jobs/<JOB_ID>/checkpoint-<EPOCH>"
3. so there is a mapping (job_id --> latest checkpoint)
4. before training start, we check if there are already checkpoints saved for the current job_id.
Yes: load the latest checkpoint and resume training
No: start training from scratch
5. add config "auto_resume_from_snapshot" to turn this feature on/off.

Reviewed By: mwu1993

Differential Revision: D17874151

fbshipit-source-id: 6778e68e90523e982c141b76e916cc7d3038de8f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17874151

…esearch#1046)

Summary:
Pull Request resolved: facebookresearch#1046

## Problem:
When a job fail at epoch X, FBLearner will retry it 3 times, from scratch(epoch1)
It wastes time and resources

## Solution:
1. The "retry" shares the same FBLearner job_id
2. we save all checkpoints at "manifold://pytext_training/tree/jobs/<JOB_ID>/checkpoint-<EPOCH>"
3. so there is a mapping (job_id --> latest checkpoint)
4. before training start, we check if there are already checkpoints saved for the current job_id.
Yes: load the latest checkpoint and resume training
No: start training from scratch
5. add config "auto_resume_from_snapshot" to turn this feature on/off.

Reviewed By: mwu1993

Differential Revision: D17874151

fbshipit-source-id: ec17cbd1a5f5022b47e2371e9063c738b67d9018
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17874151

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 8f93ce1.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants