Optimizer interface #22

darsnack · 2021-01-08T16:25:56Z

darsnack
Jan 8, 2021
Maintainer

While working on ParameterSchedulers.jl, I ran into some complications implementing a scheduled optimizer. So, I decided now would be a good time to rethink Flux’s optimizer interfaces. Interestingly, the current design is not too far off from Optimisers.jl or Optax. So the goal here is to figure out what are the key differences/requirements. Since my thoughts are somewhat sporadic, I figured it would be more helpful to present the proposed approach first. Anyone curious about my reasoning can keep reading to the bottom.

Proposed approach

Like I mention below, there is no meaningful difference in my mind between Flux’s current approach and Optax. The key difference is how Optax deals with state. Since the state is explicit, the calling functions get to manage it. This is what we need to change in Flux.

One starts by defining an abstract optimizer type. What this lets us do is generically define how to handle Params (something Optax does by assuming the use of jax.tree_multimap).

abstract type AbstractOptimizer end

init(o::AbstractOptimizer, xs::Params) =
	IdDict([x => init(o, x) for x in xs])

# generic call to apply! for AbstractArrays
function update!(o::AbstractOptimizer, x, ∇x, state)
  ∇x, state = apply!(o, ∇x, state)
  x .-= ∇x

  return x, state
end

# use multiple dispatch to de-sugar complex structures
function update!(o::AbstractOptimizer, xs::Params, ∇xs::Grads, state::IdDict)
  for (x, ∇x) in zip(xs, ∇xs)
    update!(o, x, ∇x, state[x])
  end

  return xs, state
end

Here we see where the functional approach helps us by using explicit state. We can easily initialize an IdDict of optimizer state for each model parameter in xs. Later, when we get to ScheduledOptim, we’ll see how this helps. Here, we also define the user-facing, high-level update! interface. update! is how we deal with non-AbstractArray structures.

Next, I used Momentum as an example of an optimizer. The implementation is almost the same as Flux’s current implementation, except that the velocity is now part of the state and not the struct. I used NamedTuples for the state because they are like anonymous structs which is essentially what state is. Also notice that apply! is still mutable (though the state is immutable), and an optimizer is implemented assuming ∇x is an AbstractArray.

mutable struct Momentum{T} <: Optimizer
  eta::T
  rho::T
end

init(o::Momentum, x) = (velocity = zero(x))

function apply!(o::Momentum, ∇x, state)
  η, ρ, v = o.eta, o.rho, state.velocity
  @. v = ρ * v - η * ∇x
  @. ∇x = -v

  return ∇x, (velocity = v)
end

Next, we define ScheduledOptim. I used the anonymous function approach to address which hyperparameter to set. I don’t think there’s really any compelling way to do this, and the anonymous setter function seems like the most flexible. We could still define something like lr/lr! for common hyperparameters, and it would compose perfectly well with this approach.

struct ScheduledOptim{S<:AbstractSchedule, O<:AbstractOptimizer, F} <: AbstractOptimizer
  schedule::S
  opt::O
  update!::F
end

ScheduledOptim(schedule::AbstractSchedule, opt::Momentum) =
  ScheduledOptim(schedule, opt, (o, v) -> setfield!(o, :eta, v))

init(o::Scheduler, x) = (t = 1, init(o.opt, x)...)

function apply!(o::Scheduler, ∇x, state)
  o.update!(o.opt, o.schedule[state.t])
  ∇x, opt_state = o.apply!(o.opt, ∇x, state)

  return ∇x, (t = state.t + 1, opt_state...)
end

Unlike Optax, ScheduledOptim wraps another optimizer. The approach in Optax is to define scale_by_schedule which can be part of a Optax.chain like:

opt = chain(scale_by_adam(), scale_by_schedule(polynomial_schedule())

This isn’t a compelling approach to me. The way this works is that optimizers like scale_by_adam don’t include any learning rate at all. This is unintuitive to me, since every optimizer paper will include a notion of a learning rate. So the “solution” in Optax was to remove the one parameter that they want to schedule into its own optimizer. This basically means that none of the other hyperparameters can be scheduled.

Finally, just for completeness, I define another composition like Flux.Optimiser or Optax.chain.

struct Sequence{O} <: AbstractOptimizer
  opts::O
end

init(o::Seqeunce, x) = [init(opt, x) for opt in o.opts]

function apply!(o::Sequence, ∇x, states)
  for (i, (opt, state) in enumerate(zip(o.opts, states))
    ∇x, states[i] = apply!(opt, ∇x, state)
  end

  return ∇x, states
end

Optimizer application

Currently, Flux’s optimizers define a common function, apply!(o, x, Δ), that updates Δ in-place to change the gradient step based on x and the state of o. This approach is nice for simplicity, composition, and memory efficiency. The main issue is that apply! is called on each model parameter every iteration of the innermost training loop. This becomes tricky when writing schedules as “just another optimizer,” because the schedule must be set so that it accounts for how many times apply! is invoked over the course of training.

Explicit state

Currently, the signature of apply! is

apply!(o, x, dx) -> dx'

The state of o is stored internally, and it gets updated on each call to apply!. A more functional approach would treat o as a function that augments some state and dx. The signature would look like

apply(o, x, dx, state) -> dx', state'

With this change, the updates to both dx and state by dx' and state' can be deferred (if apply is non-mutating). It releases control of the passage and evolution of state to the calling function.

Composing optimizers

If an optimizer is abstracted as a function o(dx) -> Δdx (i.e. the optimizer returns the change or transformation to dx, then Flux’s Optimiser allows composition of a series of optimizers, [o1, o2, o3] as

dx' = foldl((x, o) -> o(x)(x), o; init = dx)

This is exactly what Optax.chain does.

Another way to think of this is that optimizers are rules about gradient transformation, and compositions are rules about optimizer application. A key to making this possible is deferment. As seen above, deferment lets the calling function decide when to actually change the gradient, and the optimizer is just a rule that says how to change the gradient.

The way Optax guarantees this deferment is by each optimizer being a pair of functions: one for initialization and another for updating. My guess is that this is the cleanest way to approach this in Python + Jax since they lack multiple dispatch. With multiple dispatch, I’d argue that the current Flux.apply! is exactly the same as an Optax update_fn. All we need to do is separate out the initialization. I don’t even think mutability is that important here. At the end of the day, the operations will be done in-place (whether by construction like Flux or by optimization like Jax). Deferment is not guaranteed by immutability; it’s guaranteed by the transformation function (apply!) not being called until the composition decides to.

At this point, I think we should have a brief tangent on hyperparameters before returning to the overall design.

Optimizer hyperparameters

This is the issue most relevant to ParameterSchedulers.jl. Every optimizer is a rule parameterized by a collection of hyperparameter variables that are used to transform a gradient. Naturally, this is best expressed in Julia as

mutable struct OptRuleX
  alpha::A
  beta::B
end

where alpha and beta are the hyperparameters. The issue becomes that the meaning of a hyperparameter and its access are detached. For example, if alpha corresponds to the learning rate (LR), and I know I am dealing with a o::OptRuleX, then I can get and set the LR with o.alpha. But when writing a function that operates on a generic optimizer, we don’t know what field corresponds to the LR.

Potential solutions

Standardized field

This is a non-solution in my opinion, but it’s one that I’ll mention just to get passed it. The main idea is that common fields like the LR have a standard name (e.g. eta in the current Flux optimizers). The drawbacks are:

Users can’t build their custom optimizers with the name that’s most natural to them
Orthogonal failures when some function accesses o.eta and it doesn’t exist
No attempt to address how to get/set uncommon hyperparameters

Standardized interface functions

This is a reasonable solution. So, to be a “Flux optimizer,” your optimizer struct should implement a standard interface function like lr/lr!. For example, we could write

lr(o::OptRuleX) = o.alpha
function lr!(o::OptRuleX, val)
  o.alpha = val
end

This approach addresses the first two problems with previous solution. But it still doesn’t address the last issue of uncommon hyperparameters. A standard interface in Julia is only useful when its defined in a single place for other packages to extend. This means that any time someone wants a new optimizer to compose well with current and future optimizers, they need to submit a PR to the common base to have their hyperparameter function added. So, an interface function only seems reasonable so long as the number of functions doesn’t need to constantly increase.

Anonymous getter/setter functions

This approach is the one initially used by ParameterSchedulers.jl. It completely ignores the problem and leaves the solution up to downstream packages. For example, a scheduling package might accept an anonymous function as an argument that tells it how to set the optimizer parameter to the latest value. ScheduledOptim.update_func is a field that stores this function, and the ScheduledOptim uses it to set the wrapped optimizer parameter. The main advantage of this approach is that it is the most transparent to the user.

Hyperparameters are types

This is the approach used in FluxTraining.jl. The idea is that the type acts like a pseudo-reference to the hyperparameter field. For example, instead of passing o.eta to a scheduler, you would pass ::Type{LearningRate}. Then the scheduler calls setparameter!(o, ::Type{LearningRate}, val), and the optimizer can extend this setparameter! so that o.eta is updated. This suffers from similar issues to the standard interface, though it is slightly less cumbersome to extend in practice.

darsnack · 2021-01-08T16:30:18Z

darsnack
Jan 8, 2021
Maintainer Author

I guess possible change is that instead of the scheduler's update! field being a setter function, it could just be the symbol for the field to set in the optimizer. This would assume that setfield! is used to update which is more limiting, but I can't think of a case where that would be a problem. All hyperparameters that can be scheduled are scalars.

0 replies

ToucheSir · 2021-01-08T17:42:29Z

ToucheSir
Jan 8, 2021
Maintainer

I think the proposed approach exists more or less wholesale in Optimisers.jl already. Some points of interest:

Instead of jax.tree_multimap, Optimisers.jl uses functor to traverse a model structurally and encapsulates it into the interface instead of requiring each implementation to do its own mapping. See https://github.com/FluxML/Optimisers.jl/blob/master/src/interface.jl. Most importantly, no IdDicts!
Initialization and application are handled via dispatch instead of as callbacks. See the definition of Descent. This and 1. significantly cut down on the amount of boilerplate required for a new optimizer.
Any work that needs to happen once per optimizer call can simply be inserted before or after a call to update. The usefulness of this is currently hampered by wanting optimizers to be callable structs, but I'll get into that later.

So why not use Optimisers.jl now and be done with it?

There are no in-place operations. If one doesn't want to write two different apply methods (or an apply + apply!), we'd have to use some type of compiler transform to auto-generate one given the other. JAX added custom machinery for this kind of use-case and I don't believe Optax even uses it yet!
No consensus on mutability or constructor signatures makes hyperparam scheduling difficult. For example, Descent is mutable while ADAM is mutable. There was some discussion on this PR, but we'll need to come up with a convention so that hyperparam scheduling can happen.
These optimizers aren't really composable. Because every apply scales by η, chaining them would require "un-scaling" after all but the last step. This is when I realized why Optax separates out the learning rate multiplication into its own step. In the absence of a design that lets us easily compose optimizers without a separate scaling step, the tradeoff is between a clunkier/more unfamiliar interface and inefficient (if not non-existent) compositionality.
After dabbling with an ADAHessian implementation, I have no idea how second order optimizers are supposed to work with the current interface.

Finally, there is also the more philosophical question of what Optimisers.jl is trying to be. As you noted, Optax is clearly more of a "gradient transformation" library, but what about gradient-free optimizers in Optim.jl? I'm of the opinion that the design space and tradeoffs change quite a bit if we're treating Optimisers.jl as a gradient transformation library, a general optimizer interface, some intersection of the two or something else entirely.

6 replies

ToucheSir Jan 8, 2021
Maintainer

It seems easier to just define a single apply! and be done with it. I am not sure that I see how this avoids returning the updated model parameters + state. It just avoids the call to initialize state before the loop. Personally, I think having an init before the training loop is a good thing.

Can you elaborate on what you mean by this? I meant something like

function (o::Descent)(m, m̄, st)
  # nonsensical, but for illustration:
  eta = o.eta
  o.eta = 0.42
  println("this only runs once")
  update(o, m, m̄, st)
  o.eta = eta
end

A single apply! function is too coarse of an interface because it results in issues like FluxML/Flux.jl#1410. You can see in the current optimizer code how conditional init logic is mixed into the actual calculation. In addition to causing irreconcilable type stability issues, it also requires an IdDict and inserts a slow get! call that might stymie optimizing (e.g. fusing) operations in apply!. IMO init is a pretty elegant way to express this in Optimisers.jl.

Yeah that's a good point. But then why not have a "LR-free Adam" optimizer for those rarer cases, and the default Adam wraps the LR-free Adam w/ a learning rate. I guess this is what Optax's adam does.

Yeah, that's what I was poking at with tradeoffs. It wouldn't be the end of the world to define LR-free versions of each optimizer, but at some point the question becomes whether the LR scaling helper should just be exposed directly like it is in Optax. Again, I don't love their interface and rarely feel the need to chain multiple steps that scale by LR, but it would allow for exactly your foldl formulation above without extra ceremony in between steps.

this approach let's you schedule any field.

This is an interesting design space that I don't have much insight into. How much standardization on hyperparams is there outside of learning rate and maybe momentum?

ToucheSir Jan 8, 2021
Maintainer

I think I know why the current interface is strictly mutation-free: Optimisers.jl was originally written for use in XLA.jl (not the TPU one, the other one), which prohibits mutation but will optimize copying operations into in-place operations automatically. The TPU-targeting XLA.jl includes custom non-mutating optimizers as well.

Whether this use-case bears supporting is another question. That's getting into the "is Flux like PyTorch or is Flux like Jax" line of discussion though (interesting to be sure, but arguably too much of a distraction here).

darsnack Jan 8, 2021
Maintainer Author

IMO init is a pretty elegant way to express this in Optimisers.jl.

Yeah I think there is some confusion. I agree with having init(opt, x) and apply!(opt, dx, state) be the interface functions so that all state initialization is removed to init (exactly as it is now in Optimisers.jl). So, I was responding to the "boiler-plate code" part of the comment. AFAICT, the callable optimizers and update(opt, m, dm) = update(opt, m, dm, state) adds more boiler-plate for devs with little advantage for users. Because even though a user doesn't explicitly invoke init, they still need to capture the return of the callable/update as (dx', state'). Then pass the new state' into the next call to the callable/update. If the user uses something like Flux.train! (or another high level training loop function), then calls to init and update! are hidden inside the loop anyways, so they don't see any extra boiler-plate. If they are using an explicit training loop, then I'd rather force them to call init explicitly, since they are going to have to manage the passing of state anyways.

the LR scaling helper should just be exposed directly

Yeah initially I was thinking something like

mutable struct UnscaledAdam
  # fields except LR
end

mutable struct Adam{T}
  opt::UnscaledAdam
  eta::T
end

But we could also easily do

mutable struct UnscaledAdam
  # fields except LR
end

struct ScaledOptim{O, T}
  opt::O
  eta::T
end

Adam(eta, args...) = ScaledOptim(UnscaledAdam(args...), eta)

It might be nice to have a ScaledOptim primitive.

How much standardization on hyperparams is there outside of learning rate and maybe momentum?

I don't really think people do much else, but when I created ParameterSchedulers.jl, I was specifically wanting to schedule a hyperparameter that wasn't the LR (or even part of the optimizer). So, I figured that a flexible framework should allow you to schedule anything in the system. It's trivial to make LR the default for syntactic convenience, and we could always define lr/lr! for even more convenience.

I think it's nicer to take what Optax does and create unscaled optimizers for composability than to design the scheduling framework off that fact.

darsnack Jan 8, 2021
Maintainer Author

I think I know why the current interface is strictly mutation-free: Optimisers.jl was originally written for use in XLA.jl (not the TPU one, the other one), which prohibits mutation but will optimize copying operations into in-place operations automatically.

In that case, could we have apply!/update! as the interface, then define update(o::AbstractOptimizer, ...) and apply(o::AbstractOptimizer, ...) that just call the mutating versions after calling copy? The reason I went with a mutating interface is that after a lot of thought I couldn't find a reason why I wouldn't mutate in the end, and it's always possible to take a mutating function and make it non-mutating but not the other way round.

ToucheSir Jan 8, 2021
Maintainer

IIRC, XLA literally won't allow mutating operations in the computation graph. Because of that, one would have to rewrite all in-place accumulation operations to use temporaries instead or something. That doesn't sound like significantly more or less effort than doing things the other way around though, so I'm all for the apply!/update! combo.

ChrisRackauckas · 2021-01-08T17:45:03Z

ChrisRackauckas
Jan 8, 2021

GalacticOptim.jl is already a universal optimization interface. Why build two? Optimisers.jl was just to move the optimizers out of Julia IIRC.

5 replies

darsnack Jan 8, 2021
Maintainer Author

Is this an interface for describing optimization problems and solutions? That's slightly different from what I was going for here which a more restricted interface for optimizers in Flux. But I agree, if there's some overlap then let's not design two systems.

ChrisRackauckas Jan 8, 2021

Well it's a general interface for the optimizers. SciMLInterface (when it splits out, right now it's a file in DiffEqBase but that's splitting out tomorrow?) is for describing the problems and solutions for optimization, nonlinear solving, differential equations, etc. Then there's a symbolic component (like JuMP but more focus on nonlinear) all provided by MTK on this interface, and then common solver hubs. Quadrature.jl, GalacticOptim.jl, DifferentialEquations.jl. It's tying together core solvers of the common numerical problems into a common set of interfaces all with AD and symbolic support.

darsnack Jan 8, 2021
Maintainer Author

I guess Optimisers.jl isn't really an interface. Yes, there is init and apply! that a new optimizer implements. But other than that, it's all just implementation of the optimizers. So we aren't really attempting to create a universal optimizer interface. Just trying to update the current optimizer functions in Flux as we move them out. The goal of the update being to make some of the new features we are adding to Flux like hyperparameter scheduling to be easier.

Of course, if it's simple to just use SciMLInterface, then we will just extend that. I'm not tied to the names of the functions here.

ChrisRackauckas Jan 8, 2021

Yeah. I plan to have GalacticOptim.jl's interface over Optimisers.jl then. That's better than what we have right now and drops the Flux dependency.

ToucheSir Jan 8, 2021
Maintainer

In that case, I think Optimisers.jl has more leeway to incorporate gradient transformations or whatever other functionality we think would be better served in there than Flux.

ChrisRackauckas · 2021-01-08T17:48:04Z

ChrisRackauckas
Jan 8, 2021

Composing optimizers

Coming to GalacticOptim hopefully soon.

Explicit state

GalacticOptim uses all explicit state since implicit state cannot be compatible with non-Julia tools.

Its functions have a p for differentiable hyperparameters

6 replies

ToucheSir Jan 8, 2021
Maintainer

There's also the question of how GalacticOptim handles subsets of parameters and structural gradients (implicit params in Flux have proved to be a bit of a footgun). SciMLInterface sounds nice, but having to import solve and a bunch of unrelated types for training e.g. an AlexNet model feels...off?

ChrisRackauckas Jan 8, 2021

There's also the question of how GalacticOptim handles subsets of parameters and structural gradients (implicit params in Flux have proved to be a bit of a footgun)

It just doesn't plan to. Explicit inputs to functions. Explicit parameters. Note that in optimization I mean, minimize f(x,p)=0, so we can differentiate w.r.t. the parameter of optimization (so in an ML context, it's x are the NN weights and p can be hyperparameters, and autodiff defines derivatives w.r.t. hyperparameters).

SciMLInterface sounds nice, but having to import solve and a bunch of unrelated types for training e.g. an AlexNet model feels...off?

It's way less stuff than an optimization library. So, did you want to optimize that model? I don't understand.

ToucheSir Jan 8, 2021
Maintainer

so in an ML context, it's x are the NN weights

Just to clarify, will it just work if I pass it a Chain instead of an implicit param vector constructed using Flux.params or what have you? My understanding is that it requires the latter, whereas Optimisers (rightfully IMO) supports the former.

It's way less stuff than an optimization library. So, did you want to optimize that model? I don't understand.

I meant I only need init and apply.

ChrisRackauckas Jan 8, 2021

Just to clarify, will it just work if I pass it a Chain instead of an implicit param vector constructed using Flux.params or what have you? My understanding is that it requires the latter, whereas Optimisers (rightfully IMO) supports the former.

It's just f(x,p). Nothing implicit.

I meant I only need init and apply.

That's too much. I wouldn't export that much.

ToucheSir Jan 8, 2021
Maintainer

@darsnack had a much better explanation in the thread above, so I'll rejoin there.

ChrisRackauckas · 2021-01-08T17:53:25Z

ChrisRackauckas
Jan 8, 2021

I'm trying to find a second to work on SciMLInterface, which will just be a simple package that exports all of the problem and solution types in SciML and the solve function. GalacticOptim is just an instantiation of that interface on optimization.

0 replies

darsnack · 2021-01-08T18:12:01Z

darsnack
Jan 8, 2021
Maintainer Author

As a more general note @ToucheSir this is basically a draft of what I would PR to Optimisers.jl. I think that has the foundation already laid, and this is just a view of the eventual final product.

0 replies

ToucheSir · 2021-01-15T18:51:11Z

ToucheSir
Jan 15, 2021
Maintainer

Interesting development: PyTorch is now adding more functional optimizers for distributed training: https://github.com/pytorch/pytorch/blob/master/torch/distributed/optim/functional_adagrad.py.

I haven't looked at the interface or implementation in detail, but at least superficially it bears a large resemblance to what has been proposed here.

2 replies

darsnack Jan 15, 2021
Maintainer Author

Looks like they didn't take it as far though. Flux was really ahead of the curve here in that we always separated gradient transformation from parameter updating from the very start. It looks like this is what happened in PyTorch, but they didn't separate the internal state out. The state still appears to be part of the class.

darsnack Jan 15, 2021
Maintainer Author

Though it's nice to see that we appear to be one step ahead on the right path since other frameworks are heading the same way. I guess that's the benefit of being late to the party.

DhairyaLGandhi · 2021-01-28T11:56:51Z

DhairyaLGandhi
Jan 28, 2021

Functional seems to be the way to go in newer problem domains. We shouldn't worry about diverging from PyTorch and their interfaces etc, this isn't Torch.jl (pun intended)

Optimisers.jl has most of the optimisers now (modulo some ADAM derivatives - we have a format to copy paste the code basically), and the schedulers PR seems to match many of the different assumptions in this proposal as well.

So if you can imagine changing the behaviour of optimisers as in Optimisers.jl, using it in Flux like FluxML/Flux.jl#1481, with standard schedulers that can (be part of hooks/ be context aware), we should be good.

p1(l) = insan(l) && Flux.skip()

# or

struct Losses
  n
end
ls = Losses([])
p2(l) = append!(ls, l)

Flux.train(..., prehook = [p1, p2])

is one such way to look at it.

Notice also that we don't limit ourselves with implicit params in the PR I mentioned. We can be compatible of course, but this a clearer interface with the simple loop.

0 replies

DhairyaLGandhi · 2021-01-28T12:00:33Z

DhairyaLGandhi
Jan 28, 2021

I am debating whether or not to retain lr!(opt), lr!(opt, ::Real), NewOpt <: AbstractOpt and so on. I don't think boilerplate is the way to go if it doesn't specifically add something to the table.

I would be interested to hear why (o::opt)(x...) would hold us back (@ToucheSir)?

12 replies

ToucheSir Jan 30, 2021
Maintainer

The fmap version is what I meant by "regardless of how things are handled in Functors.jl". It can eliminate the allocation to patch, but AFAICT it can't do anything about intermediate allocations that are returned to update parameters inside of apply.

DhairyaLGandhi Jan 30, 2021

So it will determine the code path to take, which we can control. Something like an update! instead of update. Inside we can do some macro rewriting etc.

ToucheSir Jan 30, 2021
Maintainer

Gotcha. So 3) still needs to happen, the question is what level (AST, IR) the rewriting takes place at.

Edit: assuming we don't want to go through with 1) or 2)

DhairyaLGandhi Jan 30, 2021

Not necessarily, if we take the simplest approach of copying code, then it doesn't need to happen.

darsnack Jan 30, 2021
Maintainer Author

Which is why I was so confused on the Optimisers.jl issue, because I kept trying to say that in the simplest case we will need to copy the code. Having to both apply and apply! is something I suggested really early on, and it was dismissed (note I agree that I don't want to do (1) or (2)). But what I keep trying to ask is what's the plan for (3)? Of course, I'm not questioning that it can be done. I just want to know how we want to do it!

DhairyaLGandhi · 2021-01-28T12:03:10Z

DhairyaLGandhi
Jan 28, 2021

Maybe something like Schedule(f, opt) and Schedule(f, opt, ::Vararg{Pair{epoch/training_pt, update_rule}, N})

1 reply

darsnack Jan 28, 2021
Maintainer Author

Yes that's approximately the interface I am thinking.

ToucheSir · 2021-02-05T20:55:16Z

ToucheSir
Feb 5, 2021
Maintainer

Starting a new thread so we have somewhere to discuss higher-order optimizers. Xref FluxML/Optimisers.jl#4. Good prior art might be https://github.com/nestordemeure/AdaHessianJax.

6 replies

darsnack Feb 16, 2021
Maintainer Author

One advantage to not relegating the gradient call to the optimizer is that we can reuse the same optimizers with different AD packages. Until the AD tools in Julia unify around a shared API (I have seen some discussions of this on Slack), it will look ugly to support different AD frameworks within the same optimizers package.

Instead, maybe we can make the following simple change to the API:

update(opt, state, x, dxs...) -> (state', dxs'...)

The idea here is that something like Descent could implement

apply(opt::Descent, state, x::AbstractArray, dx) = state, opt.eta .* dx

and AdaHessian could implement

apply(opt::AdaHessian, state, x::AbstractArray, grad_x, hessian_x) = ...

The difference here is that the Optimisers.jl package would just splat the dxs... into the call to apply which allows us for any number of differential terms. In the case of zeroth order, it could include no terms!

freemin7 Feb 16, 2021

I heard that there is also work on an abstract autodiff interface.

ToucheSir Feb 16, 2021
Maintainer

There is, but we don't need to wait on that being completed for this particular use-case.

AzamatB Feb 16, 2021

@freemin7 @ToucheSir Do you guys have a reference for repo for autodiff interface?

darsnack Feb 16, 2021
Maintainer Author

~~I don't know if there is a repo yet. AFAIK just discussion on #autodiff on Slack.~~

Never mind, here's a link: JuliaDiff/AbstractDifferentiation.jl#1

ToucheSir · 2021-05-28T21:14:53Z

ToucheSir
May 28, 2021
Maintainer

Now that FluxML/Flux.jl#1325 has landed, thoughts on what we can do with AbstractOptimizer?

1 reply

DhairyaLGandhi May 28, 2021

For now, there isn't much to do with it. As said in the pr, it's meant to be used as an external utility. I don't want to pollute our design goals with flux with this.

Optimizer interface #22

darsnack Jan 8, 2021 Maintainer

Proposed approach

Optimizer application

Explicit state

Composing optimizers

Optimizer hyperparameters

Potential solutions

Standardized field

Standardized interface functions

Anonymous getter/setter functions

Hyperparameters are types

Replies: 12 comments · 39 replies

darsnack Jan 8, 2021 Maintainer Author

ToucheSir Jan 8, 2021 Maintainer

ToucheSir Jan 8, 2021 Maintainer

ToucheSir Jan 8, 2021 Maintainer

darsnack Jan 8, 2021 Maintainer Author

darsnack Jan 8, 2021 Maintainer Author

ToucheSir Jan 8, 2021 Maintainer

darsnack Jan 8, 2021 Maintainer Author

darsnack Jan 8, 2021 Maintainer Author

ToucheSir Jan 8, 2021 Maintainer

ToucheSir Jan 8, 2021 Maintainer

ToucheSir Jan 8, 2021 Maintainer

ToucheSir Jan 8, 2021 Maintainer

darsnack Jan 8, 2021 Maintainer Author

ToucheSir Jan 15, 2021 Maintainer

darsnack Jan 15, 2021 Maintainer Author

darsnack Jan 15, 2021 Maintainer Author

ToucheSir Jan 30, 2021 Maintainer

ToucheSir Jan 30, 2021 Maintainer

darsnack Jan 30, 2021 Maintainer Author

darsnack Jan 28, 2021 Maintainer Author

ToucheSir Feb 5, 2021 Maintainer

darsnack Feb 16, 2021 Maintainer Author

ToucheSir Feb 16, 2021 Maintainer

darsnack Feb 16, 2021 Maintainer Author

ToucheSir May 28, 2021 Maintainer

darsnack
Jan 8, 2021
Maintainer

Replies: 12 comments 39 replies

darsnack
Jan 8, 2021
Maintainer Author

ToucheSir
Jan 8, 2021
Maintainer

ToucheSir Jan 8, 2021
Maintainer

ToucheSir Jan 8, 2021
Maintainer

darsnack Jan 8, 2021
Maintainer Author

darsnack Jan 8, 2021
Maintainer Author

ToucheSir Jan 8, 2021
Maintainer

darsnack Jan 8, 2021
Maintainer Author

darsnack Jan 8, 2021
Maintainer Author

ToucheSir Jan 8, 2021
Maintainer

ToucheSir Jan 8, 2021
Maintainer

ToucheSir Jan 8, 2021
Maintainer

ToucheSir Jan 8, 2021
Maintainer

darsnack
Jan 8, 2021
Maintainer Author

ToucheSir
Jan 15, 2021
Maintainer

darsnack Jan 15, 2021
Maintainer Author

darsnack Jan 15, 2021
Maintainer Author

ToucheSir Jan 30, 2021
Maintainer

ToucheSir Jan 30, 2021
Maintainer

darsnack Jan 30, 2021
Maintainer Author

darsnack Jan 28, 2021
Maintainer Author

ToucheSir
Feb 5, 2021
Maintainer

darsnack Feb 16, 2021
Maintainer Author

ToucheSir Feb 16, 2021
Maintainer

darsnack Feb 16, 2021
Maintainer Author

ToucheSir
May 28, 2021
Maintainer