Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop training on Inf/NaN loss #2070

Merged
merged 6 commits into from
Oct 16, 2022
Merged

Stop training on Inf/NaN loss #2070

merged 6 commits into from
Oct 16, 2022

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Sep 27, 2022

Closes #1981, by altering train! so that when it encounters an infinite / NaN loss, it prints a warning throws an error, and stops. It stops before updating the model, because such an update will usually make everything NaN.

Not sure it should stop, in fact. Maybe it should skip that datapoint & continue?

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable

@ToucheSir
Copy link
Member

It would be good to have some way for user code to tell training has stopped prematurely. Not sure what that would look like: return value, callback, flag, etc?

@mcabbott
Copy link
Member Author

mcabbott commented Oct 4, 2022

One way would be to upgrade this to an error. It does depend a little on how complex/comprehensive we think train! ought to be.

I found #821, in which all options were discussed. And the idea was to write isfinite(l) || Flux.skip(), except that this got documented as something to do in a callback, which never worked, it happens too late... and nobody noticed because this was sufficiently confusing? So my opinion is that anything that complicated should be an ordinary for loop.

@ToucheSir
Copy link
Member

Raising StopException from the inner loop doesn't seem like the worst idea. Unlike calling skip or stop in callbacks, this would be able to intercept things in time. I guess the only question is whether making StopException part of the public API for this purpose is palatable.

@mcabbott
Copy link
Member Author

mcabbott commented Oct 5, 2022

StopException leads directly to break, so I think there's nothing gained by implementing an auto-check that way compared to this PR's current code?

The point of the exception was, I now see, was that you could throw it from inside the loss function passed to gradient, and still get to break. But documenting it as part of the callback was a mistake.

@ToucheSir
Copy link
Member

I missed that StopException was still being caught by the inner loop of train!. In that case, it'd have to be another exception type. Whether adding one just for this makes sense is a good question.

@mcabbott
Copy link
Member Author

mcabbott commented Oct 5, 2022

Ah OK, so you weren't proposing to throw and catch an error. It could just throw a DomainError, that seems the closest Base type.

Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but if @darsnack or @CarloLucibello wouldn't mind having a once-over that would be great.

@mcabbott mcabbott merged commit 4c38c8a into master Oct 16, 2022
@mcabbott mcabbott deleted the mcabbott-patch-3 branch October 16, 2022 14:19
@KronosTheLate
Copy link
Contributor

Thanks for taking the issue I raised so seriously, it is so nice to encounter active developers who listen to user requests. Thanks for your work <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warn on NaN loss
3 participants