-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freezing layers at model construction time #1931
Comments
The functional approach that you link to is discussed a bit in FluxML/Optimisers.jl#49. The final solution probably will be something like your description—walking the model and marking certain leaf nodes as frozen, either in an auxiliary structure or returning a wrapped one. I think we have to support some way to do this after construction, since there are many cases where you don't have access to the model's construction. Depending on how freezing lands with the functional approach, I see two ways of offering freezing at construction:
Even if we don't include this in Flux, I think it's worth documenting so that other users can benefit. |
Thanks for the link, that's an interesting approach. And good point re: often not having access to model construction, I certainly agree that having a way to freeze layers post-construction is useful. I do feel that semantically, a layer being frozen should probably be thought of as an attribute of the model layer, as opposed to an attribute of e.g. the optimizer state. Though of course this is debatable, since "being frozen" really only makes sense in a training context (the whole model is "frozen" during inference, after all).
This is intriguing, but would you not still have to store some information about which layers are frozen into the model itself? E.g. by having |
Yes, if the way we do freezing is through some wrapper on the model itself. And in that case, the macro would end up just being more complicated than writing If the way we do freezing is through auxiliary information passed to the optimizer, then this macro would be a convenient way to offer freezing syntax on construction. Of course, we want to minimize the amount of "stuff" that you need to pass around, so the final ergonomics will be one thing we'll consider as we evaluate our options. |
To permanently freeze something, I think it should be enough just to exclude all fields from So the struct Frozen{T}; layer::T; end
Flux.@functor Frozen
Flux.trainable(f::Frozen) = NamedTuple()
(f::Frozen)(x) = f.layer(x)
|
This was what I tried initially. It actually fails the tests in my original post (see the "Examples/tests" dropdown). The reason is that non-zero gradients would still be computed for frozen layers, when ideally julia> m = Frozen(Dense([2.0;;]));
julia> g = gradient(m -> sum(m([1])), m)[1]
(layers = (weight = [2.0;;], bias = Fill(1.0, 1), σ = nothing),) If you use something like Maybe it wouldn't be an issue in practice if |
Oh I'm sorry, I clearly didn't read the whole thing. What's gained by demanding a simpler gradient? The same Placing several steps inside the frozen branch, here's what I see:
|
Assuming we get thunk support soonish (though having it at the top level might be more challenging), I suppose it comes down to a more philosophical discussion around how one envisions the "training loop" to work. e.g. JAX frameworks using https://github.com/deepmind/optax follow a similar model as the linked Optimisers PR: that is, freezing parameters means not propagating their gradients in the optimizer step. In conrast, the proposal here imagines gradient flow not permeating the AD boundary at all. Both are valid and I don't think it's a matter of either-or. However, we've purposefully held off from introducing these kinds of non-control flow "utility" layers in Flux for a multitude of reasons, and I'm hesitant to start now unless the need is so overwhelming (e.g. there's no way to do this in "userspace" and it consitutes a plurality of recent feedback comments about Flux) and we're unable to provide a compelling alternative in a reasonable amount of time. |
@mcabbott I actually haven't thought about whether this would be faster. I imagine you are correct that it wouldn't be with correct optimizations. In fact, probably the various partials are required to differentiate through the layer, anyways. My point is more about correctness. I think that for consistency, a model julia> m = abs2; # arbitrary function
julia> gradient(m -> sum(m, [1.0]), m)[1] === nothing
true
julia> m = Frozen(Dense(2,4)); # frozen model
julia> gradient(m -> sum(m, [1.0]), m)[1] === nothing
false # true if using e.g. `whitebox_apply` @ToucheSir This is a completely valid reason, of course. My argument for including a layer like this would essentially be for user convenience, as well as the fact that it's somewhat tricky to implement on your own and handle edge cases. For example, how should tied weights figure into this conversation? E.g. what should the behaviour be if you have both a layer |
Tied weights are a problem for all container layers, so in that sense I don't think there's much we could do for a specific layer like As for the value of the layer itself as a standard user convenience, I think it'd be good to get the opinion of one or more of @darsnack, @CarloLucibello or @lorenzoh since they've done a lot more work in the trenches with non-trivial Flux models than I have. |
There have been several issues/PRs related to freezing model parameters:
Right now, the recommendation made in the documentation is to manually specify which parameters should not be trained using some combination of
Flux.params
andZygote.delete!
.While this works, it is somewhat inflexible in several respects:
Params
would no longer be used at all (would one need to e.g.fmap
over a model and somehow mark specific layers as frozen before passing togradient
?)For these reasons, I often find myself defining a
Frozen
layer (similar to #1001) which looks something like this:A frozen layer
l::Frozen
wraps a functorf
and has two properties:l(x) = f(x)
is differentiable with respect tox
(as opposed to e.g.l(x) = dropgrad(f(x))
which would treatf(x)
as constant)f
is treated as a constant functor: gradients ofl(x)
with respect to parameters internal tof
return zeroBelow is some test code to illustrate how this layer should behave:
Examples/tests
If there is interest in including a layer like
Frozen
into Flux I would be happy to make a PR. Of course, if there is an easy way to do what I'm describing which I have overlooked, please do let me know and I'll close this issue.The text was updated successfully, but these errors were encountered: