Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer handling of infinite loss #821

Open
ChrisRackauckas opened this issue Jul 28, 2019 · 21 comments
Open

Optimizer handling of infinite loss #821

ChrisRackauckas opened this issue Jul 28, 2019 · 21 comments

Comments

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Jul 28, 2019

Ref SciML/DiffEqFlux.jl#70 . As discussed at JuliaCon, nicer strategies can be used so that if you get an infinite loss the training won't just explode. You can discard the point and take another random draw and things like that. @oxinabox @pkofod probably know more, since I know Optim.jl and other libraries handle this.

(@freemin7)

@oxinabox
Copy link
Member

Just dropping it feels like a risky move, but I think it might actually be the right one.
under SGD random minibatch is just sampling the loss space.
If most of it is not a pole then we can just sample again

There are loss functions that have a degenerate case of -Inf when you are doing very well,
(e.g. anything given by maximizing an expectated value parameter, and dividing by a dispersion parameter.)

OTOH, this is a bit risky since this could also be a result of a user mistake in the loss function definition.

@MikeInnes
Copy link
Member

Dropping everything on Inf seems pretty expensive (we'd have to check for the presence of Inf across all gradients). But just treating Inf as 0 would be easy to do.

@oxinabox
Copy link
Member

Not gradients, losses.
Different.

@MikeInnes
Copy link
Member

If it's really losses you're concerned about, that's independent of the optimisers, which only ever see gradients. It's less clear what the action item is here since normally it's up to the user to decide what to do with a loss. We could perhaps add a check in train!, but that'd only apply if you used that, and this seems like a relatively advanced use case.

@freemin7
Copy link

I used DiffEqFlux to estimate parameters of an differential equation whose solution has a finite escape in some areas of the parameter space.
This means 3 things:

  • any good ODE solver will fail to integrate the solution past the singularity in the solution
  • the only knowledge gained by this result is: This value is bad and there is nothing that can be done.
  • The difference between training data and solution is infinite in some areas and not defined in others.

So even from a mathematical point of view there is no well behaved way to handle this. If i currently return an infinity in my loss function DiffEqFlux (and the underlying machinery) throws an error halting the training.

It would be really helpful if there was way to tell Flux to ignore this loss and possibly stay away from this area (optional but nice to have in my case where bad parameters form a region). I am not here to debate whether this should be a +Inf or a NaN or something similar.

If i had to give an action item it would be:

  • Give the basic control over the optimizer from the loss function.

My proposal would look like this:

  • +Inf : Ignore loss and do something smart to prevent the going there in the future if you can and have this implemented (Scenario: something wrong with that region of the parameter space.) doing a smart thing is not required but explicitly asked for.
  • NaN : Ignore this loss but no measures needs to be taken (Scenario: loss function behaves funky some times but it knows when it is funky, user decided to pick this function anyway but i do not have to explain why to an optimizer)
  • -Inf stop what you are doing at this moment and try to return the current state as faithful as possible. (Situation: with-in this evaluation of the loss function we have realized that the solution is perfect, don't soil it optimizer just return with your result)

@oxinabox
Copy link
Member

Dropping NaN would be bad.
In general that shows up all the time for classification if the internals of activation or loss functions are implemented wrong.
Dropping it would be bad.

I am coming round to the argument that should just use a custom training loop.
And not handle this in the optimizer.
Then can do things easily like say that 20% of runs having NaN is ok, but any more than that is an error.

-Inf doesn't always mean perfect.
It could just be your loss is pathlogical.
e.g. for -5/0 is good, but -6/0 is better; even though the loss function doesn't capture (under IEEE Math) that because it is the degenerate. case.

@freemin7
Copy link

freemin7 commented Jul 29, 2019

This counter argument makes sense. I also get the position "if there are weird values throw an error". It is a design decision.
However i also find the approach: "If you are using DiffEqFlux on a differential equation for which you can not guarantee convergence in all the parameter space just write you custom loop" a little hostile.
Adding a check for +Inf to ignore would satisfy me and satisfy the need of not crashing.
However this raises an issue about how to prevent running into the same error repeatedly. If you are just trying to fit a curve to a complete data set without randomization it is likely that the next step runs into the same problem. I have no solution for this problem but maybe you can come up with one. Maybe a callback on "not real numbered loss"?
If the solution turns out to be DiffEqFlux gets a custom loop and any user of this package who wants to change the loop has to include some boilerplate for checking for this pathological behavior restricted to their problem it is acceptable for the end user. But i am not sure that is possible with the current DiffEqFlux interface.

I think i have communicated my needs and i hand the discussion back to the experts on this code base until i am asked for. I trust you come up with something good. Thanks

@oxinabox
Copy link
Member

However i also find the approach: "If you are using DiffEqFlux on a differential equation for which you can not guarantee convergence in all the parameter space just write you custom loop" a little hostile.

Sorry, did not mean to come across as hostile
Writing a custom loop is going to be a first-class way to interact with Flux in some future version.
It's almost there already.
It isn't going to be some expert thing, but basically (IMO) something you do in the first few weeks of using Flux.
See #666 for full details


On further thought I don't think this needs a custom loop,
just a custom loss:

if you original loss function was plain_loss(x,y) your logic described can be done

function smart_loss(x,y)
    loss = plain_loss(x,y)
    if loss == -Inf  # everything perfect, stop now
        Flux.stop()   # This throws a StopException, which Flux.train! catches and treats an an termination signal
    elseif loss == Inf  # This is bad, need to fix the model, optimizer can't help
        throw(DomainError())  # this will bubble up and outside train! to somewhere it can be delt with
    elseif isnan(loss)  # it is just funky some times, ignore it.
        return 0  # this s an untracked value, so no gradients will propergate through it. So model will not change.
    else
        return loss  # A tracked value. This can be used by optimizer to update state.
    end
end 

@MikeInnes
Copy link
Member

Potentially we could add Flux.skip() as an API, which would just move on to the next data point without applying gradients for this one. That seems to be the desired behaviour on Inf IIUC.

@freemin7
Copy link

If Flux.skip() could be called from the loss function that would work too.

@ChrisRackauckas
Copy link
Member Author

Yes, that's probably what you need, because even if it's not Inf you may still want to skip. The classic example with ODEs is something that has early termination. For differential equations, you would have sol.retcode != :Success && sol.retcode != :Terminated (terminated is for manual terminate!(integrator) callbacks, which for example are used when calculating steady states with a timespan of (0.0,Inf)). So in parameter estimation we usually check the retcode and throw an Inf if there's a failure, but in this case we could just Flux.skip() it.

@pkofod
Copy link

pkofod commented Jul 31, 2019

In Optim (because this was mentioned above) we rely on users setting the objective value of "bad" inputs to Inf, but that's slightly different. There, a "bad" input is typically an in put from a region in parameter space that you cannot step into (model not defined, model can't be solved succesfully, ...), so we just use the information to backtrack into a "nice region" and do the line search there. "Optimizers" or "trainers" are a bit different in Flux world, because linesearch i rarely used. In Optim, if it's not possible to backtrack into a finite-valued region, we simply halt.

@oxinabox
Copy link
Member

If Flux.skip() could be called from the loss function that would work too.

Yes.
Flux.skip() would work the same way Flux.stop() does.
So it would throw an exception that gets handled in the standard Flux.train! loop.
Except rather than triggering break to exit the loop,
it would trigger continue to move to the next round.

@freemin7
Copy link

Poke:
Has something happened about this?

@MikeInnes
Copy link
Member

We haven't added Flux.skip() yet, but I think a PR for it would be accepted.

FWIW, you can also do something like isinf(loss) && return 0, or similar, in your forward pass. This is equivalent to skip with a small amount of extra work to compute gradients redundantly (but hopefully Inf is not being produced too often anyway).

@Moelf Moelf mentioned this issue Jun 16, 2020
4 tasks
bors bot added a commit that referenced this issue Oct 9, 2020
1232: add Flux.skip() r=DhairyaLGandhi a=Moelf

per #821

### PR Checklist

- [x] Tests are added
- [ ] Entry in NEWS.md
- [x] Documentation, if applicable
- [ ] Final review from `@MikeInnes` or `@dhairyagandhi96` (for API changes).


Co-authored-by: Moelf <jerryling315@gmail.com>
Co-authored-by: Moelf <proton@jling.dev>
Co-authored-by: Jerry Ling <proton@jling.dev>
@atiyo
Copy link
Contributor

atiyo commented Aug 31, 2021

Can this issue be closed, now that Flux.skip has been added in #1232?

@ChrisRackauckas
Copy link
Member Author

It hasn't quite been solved yet. There is now a manual way for users to be able to change the optimizer behavior in the presence of infinites, but the optimizers are still not robust to infinite loss values like something from Optim, NLopt, IPOPT, etc.

@DhairyaLGandhi
Copy link
Member

Do you think we can get some resources to add a few global optimisation routines, as well as document GalacticOptim.jl and NLOpt.jl use with Flux in Optimisers.jl?

@ChrisRackauckas
Copy link
Member Author

Sure, but that's still orthogonal. It would be good for Flux optimizers to have a robust handling of Infs which cause a rejection, a pullback to previous parameters, and a change in the step or something like what other optimizers do. Or maybe just automatically apply skip inside of the standard training loop when an Inf is seen.

@ToucheSir
Copy link
Member

Seems like that would fit well in a higher-level library like FastAI.jl.

@darsnack
Copy link
Member

darsnack commented Aug 31, 2021

Referring Mike's comment, Flux's optimizers never see objective functions or values, and Flux doesn't have any optimization routines other than train!. So the most appropriate thing here would be to add some default callbacks to train! that skip on Inf.

GalacticOptim.jl calls packages like NLopt by invoking a complete optimization routine. But for Flux, it manually steps Flux's optimizers with update!. If we want Flux to behave robustly like NLopt, then we should work on calling train! instead with the appropriate callbacks in place. And we can make train! robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants