-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some issues with Zeros #1374
Conversation
Add reshape implementation for Zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love seeing the green CI!
src/functor.jl
Outdated
@@ -78,6 +78,7 @@ gpu(x) = use_cuda[] ? fmap(CUDA.cu, x) : x | |||
adapt_storage(T::Type{<:Real}, xs::AbstractArray{<:Real}) = convert.(T, xs) | |||
|
|||
paramtype(T::Type{<:Real}, m) = fmap(x -> adapt(T, x), m) | |||
adapt(::Any, x::Zeros) = x # So that we for example don't accidentally end up with layers having a scalar for bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Any
correct here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this actually materialise? You would want to use adapt_structure
here. Thanks to Tim for the pointer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes will change to adapt_structure
following Tims advice!
::Any
instead of just T
or x
is basically just because vs code marks unused variables. Should I change to T
or did you mean that there a more constrained type to use here?
test/utils.jl
Outdated
@@ -129,6 +129,12 @@ end | |||
@test eltype(f32(f64(m))[1].W) == Float32 | |||
end | |||
|
|||
@testset "Zeros" begin | |||
m = Dense(randn(2,3), Zeros()) | |||
@test f64(m).b === m.b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this due to the Varargs
case being caught in the method above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow.
It is a test that Zeros
do not materialize from f64
. I'll add === Zeros()
in the end for now to make the intention clear.
src/functor.jl
Outdated
@@ -78,6 +78,7 @@ gpu(x) = use_cuda[] ? fmap(CUDA.cu, x) : x | |||
adapt_storage(T::Type{<:Real}, xs::AbstractArray{<:Real}) = convert.(T, xs) | |||
|
|||
paramtype(T::Type{<:Real}, m) = fmap(x -> adapt(T, x), m) | |||
adapt(::Any, x::Zeros) = x # So that we for example don't accidentally end up with layers having a scalar for bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this actually materialise? You would want to use adapt_structure
here. Thanks to Tim for the pointer!
op = bias(ip) | ||
@test sum(op) ≈ 0.f0 | ||
gs = gradient(() -> sum(bias(ip)), Flux.params(bias)) | ||
@test gs[bias.bias] === nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to test this with CUDA as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are referring to the test that gradient w.r.t Zeros is nothing, right? Will add it right away!
Btw, I had a look at how FillArrays.jl does reshape. Quite a few methods, but nothing insurmountable and maybe not all of them are needed just for Zeros. Was there ever a discussion to use (or recommend users to use) FillArrays.Zeros instead of rolling an own version? |
Another thought: Should e.g. adapt_structure(::Type{<:AbstractArray{T}}, x::Flux.Zeros) where T = adapt_structure(T, x)
adapt_structure(::Type{T}, x::Flux.Zeros{T}) where T = x
adapt_structure(T, x::Flux.Zeros) = Flux.Zeros(T, size(x)...) Drawback is that there will then be some inconsistency between layers created through e.g |
The reason we have our own |
Yeah, though it had something to do with type piracy-ish stuff. I'm almost done, but I'm trying to squeeze in some testing with optimizers as well and running into some issues with |
Hmmm, somehow it is only Conv layers which gets no gradient for Zeros, all other ways they are involved seem to result in gradients being created: julia> ll = Dense(ones(Float32, 1,1), Flux.Zeros())
Dense(1, 1)
julia> gs = gradient(() -> sum(ll(ones(Float32, 1,1))), params(ll))
Grads(...)
julia> gs[ll.b]
0-dimensional FillArrays.Fill{Float32,0,Tuple{}} = 1.0
julia> ll.b
0-dimensional Flux.Zeros{Bool,0}:
0 Edit: Its because of this adjoint: @adjoint reshape(xs::Zeros{T}, dims...) where T =
reshape(xs, dims...), _ -> nothing which is only used with conv layers. I guess we need adjoints for +,- and * too, right? |
Ugh, avoiding ambiguity while still catching all permutations of Zeros turned out to be a fair bit more painful than I though. I could not come up with a simpler way than more or less implementing every permutation. I still need to implement the non-broadcasted OPs and test the |
we should move this discussion elsewhere but I think it would be quite natural for FillArray.Zeros and FillArray.Ones to have zero gradients since they are meant to be constants. We should really consider switching to FillArray types given the amount of extra bloat having our own Zeros produces. |
Another thought is: might it be simpler not to do this at all? Just store |
@CarloLucibello I guess Zygote could do the adjoints for it without it being piracy. I can't wrap my head around if it will cause problems to someone somewhere if there are never gradients for FillArrays. I can think of cases when there is inconvenience, but I'm not sure there are any real showstoppers. If new AD means Zygote is reaching eol then perhaps it can be done as an experiment and see what complaints come in :) @mcabbott I was thinking along similar lines. It might be easier to just make something which only tries to solve the bias on/off case. Another option is to have some I'm not sure if it is desirable to have a turned off bias in params or not. Otherwise the same thing can perhaps be done more explicitly with Zeros as well. I remember being a bit surprised to see it there, but have in the back of my head that there was a good reason for it. If it needs to be in params then I think Zygote will return gradients for false too (see below). About the adjoint overloading in the last commit: The core issue afaik is that Zygote has defined adjoints for broadcasting with numerical arrays (and scalars) so even if the primal op does not use a variable the gradient for it will be created. Could this be fixed in Zygote with reasonable effort? A hack like checking if the result is identical to one of the arguments is one way ofc... Also, after getting some sleep on it there is a chance I can get rid of the adjoints and instead specialize on |
My foggy recollection was that the original reason for Zeros was to avoid needing a special case in one or two Have only skimmed the issues on loading/saving & params, but it sounds like the goal is to ignore absent biases, but |
No disagreement on my side there. This endeavour turned out to be quite the rabbit hole :) One option is that I just open another PR where try to do the |
Definitely shouldn't change the behaviour of FillArrays types, and we should keep the forward passes untouched since this is orthogonal to how those are supposed to function. Seems like this needs the correct adapt rules (check out |
We had considered the |
@DhairyaLGandhi I will look into the CUDA.Adaptor. I could not find it when browsing the code on github, but I will search once I'm back in front of the computer. My thinking right now is that the specialized I looked through the issue for the param loading case: #1277 It seems that at least the issue with Zeros turning to Arrays is solved by this PR. I will look at tests for One obvious failure case is if someone tries to add a bias to a non-bias layer with |
@DhairyaLGandhi I looked up CUDA.Adaptor in the code but I could still not determine if it had any other purpose than moving stuff to the GPU which is what we want to avoid altogether here, right? Is this sufficient to tell if it is handled correctly? julia> cu(Flux.Zeros())
0-dimensional Flux.Zeros{Bool,0}:
0
julia> cu(Flux.Zeros(2))
2-element Flux.Zeros{Bool,1}:
0
0 |
Adding fixes and tests for There are two ways one can view the act of disabling bias:
I think the current approach with For
|
Ok, I think functionality wise this PR is now ready as far as I can tell. I will start drafting another version where Zeros is not an AbstractArray and submit that as a separate PR. |
1379: Fix some issues with Zeros option 2 r=CarloLucibello a=DrChainsaw Fixes #1332, #1277 This is take two on fixing the above issues and is intended to be mutually exclusive to #1374. In this version Zeros is no longer an AbstractArray and is thus more like a there-is-no-such-parameter marker instead of this-parameter-is-an-immutable-array-of-zeros type. I made the change so that the advertised way of disabling bias is through `bias=false` rather than `bias=Zeros()`. `Zeros` itself is alot less capable here compared to the previous attempt, but that also keeps the number of moving parts down, i.e no need to test that both the 0-dim version and the full size version works the same. All in all there are fewer specializations compared to #1374. I think that a rename could definitely be in place. Better names I can think of are `bias=Off`, `bias=None` or as suggested by @mcabbott `bias=False()`. The best argument against renaming I can think of is to be a little bit less breaking. ### PR Checklist - [X] Tests are added - [X] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from `@dhairyagandhi96` (for API changes). Co-authored-by: DrChainsaw <Christian.kyril.skarby@gmail.com>
Redundant after #1379 |
Fixes #1332, #1277
Two things where causing trouble:
adapt
(used to change type of parameters) turningZeros()
intoT
sreshape
(used in conv-layers) turningZeros
intoBase.ReshapedArray
sBoth are adressed by specializations.
I'm not sure what is the right way to overload reshape. Leaving the type out for the second argument leads to ambiguity, so there are still more than a handful ways to reshape Zeros which leads to issue 2). Any packages which does it I can look at?
I was too lazy to come up with a nice way to check dimensions of parameters when creating a layer.
Furthermore, a number of adjoints are defined to ensure that gradients of any
Zeros
is nothing.Also fixes a few issues with
loadparams!
anddestructure
in the presence ofZeros
.PR Checklist
@dhairyagandhi96
(for API changes).