Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't do state cleanup if training not converged and return trainer's state #581

Merged
merged 4 commits into from
Nov 27, 2018

Conversation

artemsok
Copy link
Contributor

@artemsok artemsok commented Nov 26, 2018

By default the training_state folder is deleted even if the training did not converge, i.e. the number of checkpoints did not hit --max-num-not-improved, but the training stopped because of reaching --max-samples, --max-num-epochs, or --max-updates. This changes the behaviour to keeping the folder to allow later training continuation and not enforcing the above parameters to match with the old run. Additionally a full training state is returned (with metrics, number of epochs, convergence status etc.), allowing its processing in a calling code.

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • System tests pass (pytest test/system)
  • Passed code style checking (./style-check.sh)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, makes a lot of sense to have this!

sockeye/constants.py Outdated Show resolved Hide resolved
sockeye/training.py Outdated Show resolved Hide resolved
sockeye/training.py Outdated Show resolved Hide resolved
sockeye/train.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@artemsok artemsok changed the title Don't do state cleanup if training not converged and return a flag to indicate convergence Don't do state cleanup if training not converged and return trainer's state Nov 27, 2018
Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

CHANGELOG.md Outdated Show resolved Hide resolved
sockeye/training.py Outdated Show resolved Hide resolved
@fhieber fhieber merged commit 4cb7f69 into awslabs:master Nov 27, 2018
fhieber pushed a commit that referenced this pull request Dec 19, 2018
…#610)

Fixing typos introduced in #581 and adding a small test to prevent similar errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants