-
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: basic sketch of scheduling #15
base: master
Are you sure you want to change the base?
Conversation
While going through PS, I felt there was some extra boilerplate needed for simple utilities, so I want to keep the API minimal. I think it's fair to say that if you want to do more than mess with the learning rate, you'll need to overload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach looks good to me in general (it is what I would call the "scheduler"). AFAICT in general it follows a similar design to Scheduler
from the original FluxML/Flux.jl#1506. So, I am totally cool with the design proposed here. I left some more minor comments.
Just for clarity, I think there is a misconception about what ParameterSchedulers.jl does. None of the functionality in this PR is intended to be part of ParameterSchedulers.jl. You can think of ParameterSchedulers.jl as just providing standard f
to pass to Schedule
.
src/schedulers.jl
Outdated
function apply(s::Schedule, x, dx, st) | ||
schedst, optst = st | ||
cursor, cursor_step = schedst | ||
o = next(s, schedst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of next
here, I would suggest the following:
struct Schedule{F, O}
schedule::F
opt::O
end
# usage example 1
Schedule(f, eta -> Momentum(eta, 0.9))
# usage example 2
Schedule(f, (o, eta) -> Momentum(eta, o.rho))
Then in apply
:
# ...
o = s.opt(s.schedule(cursor))
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like the simplicity of the apply, having the function there seems verbose. You would also have to return things more carefully from the schedule
to not get cryptic method errors.
next maybe is poorly worded? We want to segregate the steps of generating a new optimiser to update the fields and state from the step of applying the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I not sure what you mean by the method errors. Could you elaborate?
We want to segregate the steps of generating a new optimiser to update the fields and state from the step of applying the scheduler.
I agree, but what I am suggesting is that instead of next
which restricts the step of generating a new optimizer to modify LR only, we use an anonymous function for this step. Verbosity can be avoided by defining:
Schedule(f, o::Momentum) = Schedule(f, eta -> Momentum(eta, o.rho))
It would be just as succinct as the current API in the case of LR, as well as allow someone to schedule something other than the LR if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it restrict it to lr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could give it any field to modify, this is an example. I think it's fair to ask for a shorter syntax than overloading next though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next
specifically updates opt.eta
? You could of course add an argument to next
to allow any field, then it is a question of whether we prefer the closure way of specifying the field to update or the symbol way. I think the closure is much more clear, doesn't require the user to look up the struct field names, and less opaque.
I think there is some confusion. Like I mention above, ParameterSchedulers.jl is not meant to provide the functionality in this PR. It provides various standard The main difference between appears to be in defining ### Approach outline in this PR
struct InvDecay{T}
decay::T
end
# (start, step) -> consider step to be a function/ vector of step sizes?
init(inv::InvDecay, x) = (1, 1)
(inv::InvDecay)(t) = 1 / (1 + inv.decay * t)
### Approach in ParameterSchedulers
struct Inv{T} <: AbstractSchedule
decay::T
end
Base.getindex(inv::Inv, t) = 1 / (1 + inv.decay * t) As you can see, there isn't really much of a difference, so I don't see the point about boilerplate. But this PR treats
I would suggest looking at the interface requirements in https://darsnack.github.io/ParameterSchedulers.jl/dev/docs/interfaces/generic.html. It is exactly two functions and quite flexible. Implementing a new schedule requires no more boilerplate than what's being proposed here, and you unlock a lot more functionality/reuse than the approach outlined here. Take for example if my schedule function was s = Sequence(schedules = [Loop(Lambda(f = t -> t^3 + 0.5 * t^2 - 0.1), period = 10), Exp(0.1)],
step_sizes = [100, nepochs - 100]) In other words, there is no boilerplate required (you don't even need to define a new struct). And more importantly, you could use the opt = Momentum()
scheduler = Schedule(s, opt) PLUS you can still do for (eta, epoch) in zip(s, 1:nepochs)
# ...
end and others. It's not impossible to do the same stuff with this PR, but it requires more boilerplate. You could reduce some of that boilerplate, but I feel it will still be more than ParameterSchedulers.jl because it uses the optimizers interface which isn't meant for scheduling policies. On Slack, there was some concern over the abstract types in ParameterSchedulers.jl. I agree it is better to avoid abstract types just for the sake of defining a type hierarchy. But that's not their role in ParameterSchedulers.jl. They are actually used for dispatch to reduce boilerplate code. I can look into whether they can be eliminated. I would prefer no abstract types, but I think they are required. |
Also mentioned on Slack, how do implement time step state consistently for a group of parameters? There is no need to worry about this. As long as we define |
Maybe something is misunderstood. This is currently a minimal sketch. We would expect to flesh out details about the examples you suggest. You don't need to overload
I would say that the interface you suggest is possible - not to mention, this isn't complete so I would not comment on its current functionality as being its final form. Why do you think we wouldn't be able to reuse code? Btw i was pointing to use of startval, endval, cycle etc in my previous comments, apologies I should have been clearer.
I would typically not want syntax like this but replace it with a generator or existing iterator from base to get this behavior. |
My point exactly. It seems weird to me that some structs in Optimisers.jl will implement
Yes, I totally agree that it is all possible in this PR. I am not trying to suggest it isn't possible. If it isn't clear, I'm trying to highlight the fact that all the infrastructure is already implemented in ParameterSchedulers.jl. Why are we trying to reimplement the same functionality all over again?
Those are optional interfaces. The minimal interface requirements (if you read the page I linked) are just a single function provided by Base. The optional interfaces are meant to reduce boilerplate code across "decay" type schedules and "cyclic" type schedules. If you look at the cyclic optional interface, there is a common bit that all "cyclic" type schedules share. The optional interface allows a user to avoid having to repeat that common bit over and over for every different kind of cyclic schedule and just define the part that's unique. The purpose of the optional interfaces is to reduce boilerplate code. Just as a note, I didn't arbitrarily decide to make optional interfaces for "decay" and "cyclic" types. Those types are based on published papers on scheduling policies. I didn't make up those definitions, they are defined in the literature. We can remove the |
I am trying to steer the conversation into finding out what building blocks we need to support to write proper scheduling routines. This isn't about PS and it's choices. I don't understand why it should be that way either. |
I agree 100%. I guess the earlier comments about PS confused me into thinking we were conflating the two. In which case, aside from my review comments, I think the design for applying schedules to optimizers presented here is a good one. In my view it is the correct abstraction, so we can just discuss the finer points in the comment threads. As for designing the schedules themselves (e.g. |
Thinking about this a little more and trying to be more concrete about my thoughts First, I think that the schedules (i.e. policies like Second, I think that the schedules should be defined by simply "value of Third, I do not like struct Throttle{S, T<:Integer}
schedule::S
rate::T
end
(s::Throttle)(t) = s.schedule(s.rate * t) Starting at a specific struct Inv{T}
scale::T
decay::T
end
(s::Inv)(t) = s.scale / (1 + s.decay * t) Fourth (this ties into my previous point), the schedule policies should not be tied to optimizers only. These are generic annealing policies that can be used in a variety of contexts and domains. This is why I feel that the appropriate location for them is not in Flux or Optimisers.jl. Flux/Optimisers should be one specific user of these schedule policies (arguably the most important user from our perspective), but we shouldn't design everything as if we are the only consumers of the package. That's why I don't like relying on |
So now we should be able to write things like
I am also working on getting things like vectors of scheduling functions working.
I think it might be reasonable to move scheduling functions from PS into here (or a separate Schedulers.jl) I'm thinking the design should allow for a scheduler to either be passed into a callback and produce the optimiser state correctly, or be linked into the update step. Having it as a separate call is not particularly useful (we allow it still). Maybe if we want to chuck out the new optimiser and schedule state, that would be a way forward, but that forces the presence of a schedule to begin with, which is a strong assumption in a training loop IMO.
|
Or just keep them in PS and use them here and elsewhere (would be equivalent to have a separate Schedulers.jl no?)
For things like People can still write
Also already implemented in PS. I just want to be clear that I am not trying to force PS. I am just suggesting that a lot of completed work is being redone.
Can you explain this a bit more? I didn't quite follow what you were getting at. I'm not against a schedule being passed as a callback, but I think the |
Have you seen the darsnack/rm-optim branch of PS? It incorporates almost all of your suggestions, and I think matches what you are implementing here w.r.t. to the schedule policies. The |
The reason I want to hook in to the The reason it is being done here is simple. This is where some of the infra is written, and PS is thought of as a collection of scheduling functions, so it can consume this api. if we are to use PS, and go through the trouble of transferring and so on, I would also want to ensure the code structure is that PS simply consumes the API defined here. (I am deliberately trying to make them compatible). I also don't like the collection of the generators - makes things slower than they should be.
Right, and by extension, the |
I completely agree, I'm only trying to point out that you don't need all the schedules to also hook into
I agree, that's why the updated
Sure, I have no problem with that. Though in general, for most schedules, Unless you meant for some other functionality from |
So the implementation of Sequence is essentially this right. Sure scheduling functions are infinite (as long as they are functions), but Sequence is still used to have a composite scheduling policy. Re
Sequence is just generalised |
Not just arbitrary functions. Almost all common schedules like exponential decay, inverse decay, cosine annealing, cyclic LR, etc. are infinite iterators.
Then
This isn't necessarily a sequence. This could read as |
Also, sequences aren't the only composite schedule. Looping is a composite schedule. Throttling/accelerating is a composite schedule. It seems weird that sequences get the special |
Those are handled as usual, so I guess it's fine to table that
You totally could, the change only had to be handled internally, nothing changes from the consumer point, be that PS, or an end user writing a custom scheduler or training loop
Could you write an MWE? The current implementation of this or ps should not have access to state beyond that of the scheduler. But if you want to mess with a parameter, we can totally add that too. The optimiser is in its own little space in this abstraction, what you're modifying is of no concern here. |
Currently, with the proposed interface, I can write the following: s = Inv(...)
for t in 1:100
# some code
my_object.param = s(t)
# some other code
end If you define s = Sequence(Inv(...) => 50, Exp(...) => 50)
for t in 1:100
# some code
my_object.param = s(t)
# some other code
end
I work on spiking neural networks. Being another learning system, these networks have various hyper-parameters. I want to schedule them the same way you might schedule the learning rate in deep learning. The proposed syntax of Another example from my work. In adversarial ML, there are various hyper-parameters for the attack algorithm. I want to schedule these hyper-parameters. They don't belong to any optimizer, so the syntax |
I think there is a misunderstanding, I am proposing that the Re Sequence, you can do that if you track the state as well, which we can technically hide from user code but I don't know if it's necessary. Rather than give it infinite schedules like Inv, you can give it a generator? I guess we can add sugar to it to make it easier to write, but I'm hoping for one of two things. Either there is a regular function in the vector of schedules to finish the remainder of the training (or we just add an identity block at the end), or break the training loop once the schedules are exhausted. I think if we can handle both, with some tweaks. |
I think I understand the point of spiking nets and why you might shy away from an explicit state, but in those cases if you're opting in to scheduling, we can also store the state internally, but I am not sure if it's a deal breaker |
Or you could just use an immutable version of
Having something like Just to be clear, I support your proposal here for |
The branch you mention does seem to reduce code bloat. While you're at it, I would also definitely remove the need for defining I do think Sequence isn't the biggest deal breaker since you could get that effect in a loop anyway. It doesn't seem to belong in PS imo. |
I can do that. These are part of the optional iteration interface in Base. I don't really know how they are used in practice though.
I feel like all composite schedules belong with the basic schedules. For example, having sequence be like |
Right exactly. These building blocks belong here, and the consumption in PS. |
But none of those blocks relate specifically to optimizers? They are just another schedule. Ergo, the schedule package makes most sense to me. |
Going back to my examples with the spiking neural network. Why should I need to use Optimisers.jl/Flux.jl AND ParameterSchedulers.jl when all I need is related to hyper-parameter schedules and fits neatly in PS? I don't need nor want anything related to optimization. |
@darsnack (whose ParametersSchedule.jl has implemented a bunch of these and I tried to understand the different design choices of those) one concern that I do have is that we don't have much of a view in the running stats of a training loop in this abstraction.
We also want to generally eliminate a state as part of the scheduler, which adds an extra line to the apply function. We generally have a decent view of how to implement a bunch of different routines with this. It allows a few different kinds of callables, and there's usually little boilerplate.
It would look something like
Schedule(InvDecay(0.5f0), ADAM())
Or
Schedule(sine(0.5f0), opt)
We would also want to think of standard schedules to introduce. We can customise working with apply and next which is a clean separation of concern with this.