Don't do state cleanup if training not converged and return trainer's state #581

artemsok · 2018-11-26T16:43:55Z

By default the training_state folder is deleted even if the training did not converge, i.e. the number of checkpoints did not hit --max-num-not-improved, but the training stopped because of reaching --max-samples, --max-num-epochs, or --max-updates. This changes the behaviour to keeping the folder to allow later training continuation and not enforcing the above parameters to match with the old run. Additionally a full training state is returned (with metrics, number of epochs, convergence status etc.), allowing its processing in a calling code.

Pull Request Checklist

Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
Unit tests pass (pytest)
System tests pass (pytest test/system)
Passed code style checking (./style-check.sh)
You have considered writing a test
Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…dicate convergence.

fhieber

Thanks for the change, makes a lot of sense to have this!

sockeye/constants.py

sockeye/training.py

sockeye/train.py

CHANGELOG.md

fhieber

LGTM

CHANGELOG.md

sockeye/training.py

…#610) Fixing typos introduced in #581 and adding a small test to prevent similar errors.

artemsok requested review from davvil, fhieber, mjdenkowski and tdomhan as code owners November 26, 2018 16:43

artemsok force-pushed the cleanup branch from 2acfc77 to 6d4ad78 Compare November 26, 2018 16:52

No training state cleanup if not converged and returning a flag to in…

b282b9f

…dicate convergence.

artemsok force-pushed the cleanup branch from 6d4ad78 to b282b9f Compare November 26, 2018 16:53

fhieber reviewed Nov 26, 2018

View reviewed changes

sockeye/constants.py Outdated Show resolved Hide resolved

sockeye/training.py Outdated Show resolved Hide resolved

sockeye/training.py Outdated Show resolved Hide resolved

sockeye/train.py Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

Artem Sokolov added 2 commits November 27, 2018 08:28

Addressing comments

17df14c

Updating the CHANGELOG message.

70f1a11

artemsok changed the title ~~Don't do state cleanup if training not converged and return a flag to indicate convergence~~ Don't do state cleanup if training not converged and return trainer's state Nov 27, 2018

fhieber approved these changes Nov 27, 2018

View reviewed changes

tdomhan reviewed Nov 27, 2018

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

sockeye/training.py Outdated Show resolved Hide resolved

Wording in CHANGELOG and no forward referencing.

568a439

tdomhan approved these changes Nov 27, 2018

View reviewed changes

fhieber merged commit 4cb7f69 into awslabs:master Nov 27, 2018

artemsok mentioned this pull request Dec 19, 2018

Fixing arguments that are allowed to differ for training continuation #610

Merged

8 tasks

fhieber pushed a commit that referenced this pull request Dec 19, 2018

Fixing arguments that are allowed to differ for training continuation (…

390acde

…#610) Fixing typos introduced in #581 and adding a small test to prevent similar errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't do state cleanup if training not converged and return trainer's state #581

Don't do state cleanup if training not converged and return trainer's state #581

artemsok commented Nov 26, 2018 •

edited

Loading

fhieber left a comment

fhieber left a comment

Don't do state cleanup if training not converged and return trainer's state #581

Don't do state cleanup if training not converged and return trainer's state #581

Conversation

artemsok commented Nov 26, 2018 • edited Loading

Pull Request Checklist

fhieber left a comment

Choose a reason for hiding this comment

fhieber left a comment

Choose a reason for hiding this comment

artemsok commented Nov 26, 2018 •

edited

Loading