Stop training on Inf/NaN loss #2070

mcabbott · 2022-09-27T14:46:29Z

Closes #1981, by altering train! so that when it encounters an infinite / NaN loss, it ~~prints a warning~~ throws an error, and stops. It stops before updating the model, because such an update will usually make everything NaN.

Not sure it should stop, in fact. Maybe it should skip that datapoint & continue?

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable

test/optimise.jl

ToucheSir · 2022-10-04T23:26:08Z

It would be good to have some way for user code to tell training has stopped prematurely. Not sure what that would look like: return value, callback, flag, etc?

mcabbott · 2022-10-04T23:46:21Z

One way would be to upgrade this to an error. It does depend a little on how complex/comprehensive we think train! ought to be.

I found #821, in which all options were discussed. And the idea was to write isfinite(l) || Flux.skip(), except that this got documented as something to do in a callback, which never worked, it happens too late... and nobody noticed because this was sufficiently confusing? So my opinion is that anything that complicated should be an ordinary for loop.

ToucheSir · 2022-10-05T00:51:33Z

Raising StopException from the inner loop doesn't seem like the worst idea. Unlike calling skip or stop in callbacks, this would be able to intercept things in time. I guess the only question is whether making StopException part of the public API for this purpose is palatable.

mcabbott · 2022-10-05T01:08:36Z

StopException leads directly to break, so I think there's nothing gained by implementing an auto-check that way compared to this PR's current code?

The point of the exception was, I now see, was that you could throw it from inside the loss function passed to gradient, and still get to break. But documenting it as part of the callback was a mistake.

ToucheSir · 2022-10-05T05:56:08Z

I missed that StopException was still being caught by the inner loop of train!. In that case, it'd have to be another exception type. Whether adding one just for this makes sense is a good question.

mcabbott · 2022-10-05T14:26:30Z

Ah OK, so you weren't proposing to throw and catch an error. It could just throw a DomainError, that seems the closest Base type.

ToucheSir

LGTM, but if @darsnack or @CarloLucibello wouldn't mind having a once-over that would be great.

KronosTheLate · 2022-10-19T16:17:52Z

Thanks for taking the issue I raised so seriously, it is so nice to encounter active developers who listen to user requests. Thanks for your work <3

mcabbott added 3 commits September 27, 2022 10:44

stop training on Inf/NaN loss

bd8a4dd

add a test

fe3735d

improve test

77e52ea

mcabbott commented Sep 27, 2022

View reviewed changes

test/optimise.jl Show resolved Hide resolved

improve test

91a93b9

mcabbott added 2 commits October 10, 2022 09:06

Update train.jl

36b1302

Update optimise.jl

bacf997

ToucheSir approved these changes Oct 12, 2022

View reviewed changes

ToucheSir mentioned this pull request Oct 14, 2022

Add support for AD backends and explicit optimizers #2083

Closed

3 tasks

mcabbott merged commit 4c38c8a into master Oct 16, 2022

mcabbott deleted the mcabbott-patch-3 branch October 16, 2022 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop training on Inf/NaN loss #2070

Stop training on Inf/NaN loss #2070

mcabbott commented Sep 27, 2022 •

edited

Loading

ToucheSir commented Oct 4, 2022

mcabbott commented Oct 4, 2022

ToucheSir commented Oct 5, 2022

mcabbott commented Oct 5, 2022

ToucheSir commented Oct 5, 2022

mcabbott commented Oct 5, 2022

ToucheSir left a comment

KronosTheLate commented Oct 19, 2022

Stop training on Inf/NaN loss #2070

Stop training on Inf/NaN loss #2070

Conversation

mcabbott commented Sep 27, 2022 • edited Loading

PR Checklist

ToucheSir commented Oct 4, 2022

mcabbott commented Oct 4, 2022

ToucheSir commented Oct 5, 2022

mcabbott commented Oct 5, 2022

ToucheSir commented Oct 5, 2022

mcabbott commented Oct 5, 2022

ToucheSir left a comment

Choose a reason for hiding this comment

KronosTheLate commented Oct 19, 2022

mcabbott commented Sep 27, 2022 •

edited

Loading