Fix some issues with Zeros #1374

DrChainsaw · 2020-10-28T23:06:43Z

Two things where causing trouble:

adapt (used to change type of parameters) turning Zeros() into Ts
reshape (used in conv-layers) turning Zeros into Base.ReshapedArrays

Both are adressed by specializations.

I'm not sure what is the right way to overload reshape. Leaving the type out for the second argument leads to ambiguity, so there are still more than a handful ways to reshape Zeros which leads to issue 2). Any packages which does it I can look at?

I was too lazy to come up with a nice way to check dimensions of parameters when creating a layer.

Furthermore, a number of adjoints are defined to ensure that gradients of any Zeros is nothing.

Also fixes a few issues with loadparams! and destructure in the presence of Zeros.

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable
Final review from @dhairyagandhi96 (for API changes).

Add reshape implementation for Zeros

DhairyaLGandhi

Love seeing the green CI!

DhairyaLGandhi · 2020-10-30T11:40:29Z

src/functor.jl

@@ -78,6 +78,7 @@ gpu(x) = use_cuda[] ? fmap(CUDA.cu, x) : x
 adapt_storage(T::Type{<:Real}, xs::AbstractArray{<:Real}) = convert.(T, xs)

 paramtype(T::Type{<:Real}, m) = fmap(x -> adapt(T, x), m)
+adapt(::Any, x::Zeros) = x # So that we for example don't accidentally end up with layers having a scalar for bias


Is the Any correct here?

Also, does this actually materialise? You would want to use adapt_structure here. Thanks to Tim for the pointer!

Yes will change to adapt_structure following Tims advice!

::Any instead of just T or x is basically just because vs code marks unused variables. Should I change to T or did you mean that there a more constrained type to use here?

DhairyaLGandhi · 2020-10-30T11:42:48Z

test/utils.jl

@@ -129,6 +129,12 @@ end
  @test eltype(f32(f64(m))[1].W) == Float32
 end

+@testset "Zeros" begin
+  m = Dense(randn(2,3), Zeros())
+  @test f64(m).b === m.b


Is this due to the Varargs case being caught in the method above?

Not sure I follow.

It is a test that Zeros do not materialize from f64. I'll add === Zeros() in the end for now to make the intention clear.

DhairyaLGandhi · 2020-10-30T11:53:33Z

src/functor.jl

@@ -78,6 +78,7 @@ gpu(x) = use_cuda[] ? fmap(CUDA.cu, x) : x
 adapt_storage(T::Type{<:Real}, xs::AbstractArray{<:Real}) = convert.(T, xs)

 paramtype(T::Type{<:Real}, m) = fmap(x -> adapt(T, x), m)
+adapt(::Any, x::Zeros) = x # So that we for example don't accidentally end up with layers having a scalar for bias


Also, does this actually materialise? You would want to use adapt_structure here. Thanks to Tim for the pointer!

DhairyaLGandhi · 2020-10-30T12:52:51Z

test/layers/conv.jl

+    op = bias(ip)
+    @test sum(op) ≈ 0.f0
+    gs = gradient(() -> sum(bias(ip)), Flux.params(bias))
+    @test gs[bias.bias] === nothing


Need to test this with CUDA as well

I guess you are referring to the test that gradient w.r.t Zeros is nothing, right? Will add it right away!

DrChainsaw · 2020-10-30T20:06:56Z

Btw, I had a look at how FillArrays.jl does reshape. Quite a few methods, but nothing insurmountable and maybe not all of them are needed just for Zeros.

Was there ever a discussion to use (or recommend users to use) FillArrays.Zeros instead of rolling an own version?

DrChainsaw · 2020-10-30T20:51:10Z

Another thought: Should e.g. f32(Zeros()) return a Zeros{Float32}? I guess something like this:

adapt_structure(::Type{<:AbstractArray{T}}, x::Flux.Zeros) where T = adapt_structure(T, x)
adapt_structure(::Type{T}, x::Flux.Zeros{T}) where T = x 
adapt_structure(T, x::Flux.Zeros) = Flux.Zeros(T, size(x)...)

Drawback is that there will then be some inconsistency between layers created through e.g Conv(...;bias=Zeros()) v.s. the ones which are mapped. So far everything seems to work fine with Zeros being of type Bool so maybe not touch it now.

DhairyaLGandhi · 2020-10-30T20:54:13Z

The reason we have our own Zeros type is that Flux needs to own a type that it can define a bunch of these routines for, otherwise it's very unexpected to see why FillArrays' types should not have gradients.

DrChainsaw · 2020-10-30T21:05:01Z

The reason we have our own Zeros type is that Flux needs to own a type that it can define a bunch of these routines for

Yeah, though it had something to do with type piracy-ish stuff.

I'm almost done, but I'm trying to squeeze in some testing with optimizers as well and running into some issues with zero, so I might need to add that too.

DrChainsaw · 2020-10-30T21:25:16Z

Hmmm, somehow it is only Conv layers which gets no gradient for Zeros, all other ways they are involved seem to result in gradients being created:

julia> ll = Dense(ones(Float32, 1,1), Flux.Zeros())
Dense(1, 1)

julia> gs = gradient(() -> sum(ll(ones(Float32, 1,1))), params(ll))
Grads(...)

julia> gs[ll.b]
0-dimensional FillArrays.Fill{Float32,0,Tuple{}} = 1.0

julia> ll.b
0-dimensional Flux.Zeros{Bool,0}:
0

Edit: Its because of this adjoint:

@adjoint reshape(xs::Zeros{T}, dims...) where T =
               reshape(xs, dims...), _ -> nothing

which is only used with conv layers.

I guess we need adjoints for +,- and * too, right?

DrChainsaw · 2020-10-31T01:30:56Z

Ugh, avoiding ambiguity while still catching all permutations of Zeros turned out to be a fair bit more painful than I though. I could not come up with a simpler way than more or less implementing every permutation.

I still need to implement the non-broadcasted OPs and test the / case, but I'm too tired for that now.

CarloLucibello · 2020-10-31T08:35:40Z

we should move this discussion elsewhere but I think it would be quite natural for FillArray.Zeros and FillArray.Ones to have zero gradients since they are meant to be constants. We should really consider switching to FillArray types given the amount of extra bloat having our own Zeros produces.

mcabbott · 2020-10-31T09:29:53Z

Another thought is: might it be simpler not to do this at all? Just store bias=false (or bias=False()) instead of a fake array, remove Zeros() completely. This behaves correctly under broadcast already, and I think scalars are ignored by params.

DrChainsaw · 2020-10-31T10:05:06Z

@CarloLucibello I guess Zygote could do the adjoints for it without it being piracy. I can't wrap my head around if it will cause problems to someone somewhere if there are never gradients for FillArrays. I can think of cases when there is inconvenience, but I'm not sure there are any real showstoppers. If new AD means Zygote is reaching eol then perhaps it can be done as an experiment and see what complaints come in :)

@mcabbott I was thinking along similar lines. It might be easier to just make something which only tries to solve the bias on/off case. Another option is to have some Off struct which is not an AbstractArray but using false is perhaps a bit more elegant as alot of things are already implemented for it. I think one issue is that the conv-layers for some reason need to reshape the bias. An own type could just short-circuit reshape though, just like Zeros does in this commit.

I'm not sure if it is desirable to have a turned off bias in params or not. Otherwise the same thing can perhaps be done more explicitly with Zeros as well. I remember being a bit surprised to see it there, but have in the back of my head that there was a good reason for it. If it needs to be in params then I think Zygote will return gradients for false too (see below).

About the adjoint overloading in the last commit: The core issue afaik is that Zygote has defined adjoints for broadcasting with numerical arrays (and scalars) so even if the primal op does not use a variable the gradient for it will be created. Could this be fixed in Zygote with reasonable effort? A hack like checking if the result is identical to one of the arguments is one way ofc...

Also, after getting some sleep on it there is a chance I can get rid of the adjoints and instead specialize on Zygote.unbroadcast so it returns nothing if one of the arguments is a Zeros. For multiplication this would then only work conj(::Zeros) is sacrificed so it returns a real array of zeros. Not sure if this will work out with e.g. CuArrays though. The solution would then also become reliant on what I guess should be an implementation detail in Zygote.

mcabbott · 2020-10-31T10:52:12Z

I think one issue is that the conv-layers for some reason need to reshape the bias. An own type could just short-circuit reshape though, just like Zeros does in this commit.

My foggy recollection was that the original reason for Zeros was to avoid needing a special case in one or two conv methods. But that doesn't sound so bad compared to all this.

Have only skimmed the issues on loading/saving & params, but it sounds like the goal is to ignore absent biases, but Zeros <: AbstractArray means it needs all kinds of special-casing there.

DrChainsaw · 2020-10-31T11:00:22Z

But that doesn't sound so bad compared to all this.

No disagreement on my side there. This endeavour turned out to be quite the rabbit hole :)

One option is that I just open another PR where try to do the bias=false approach and see where that takes us.

DhairyaLGandhi · 2020-10-31T14:25:12Z

Definitely shouldn't change the behaviour of FillArrays types, and we should keep the forward passes untouched since this is orthogonal to how those are supposed to function.

Seems like this needs the correct adapt rules (check out CUDA.Adaptor) and a way to tell cuda not to materialize it till there's a kernel call to CuDNN perhaps.

DhairyaLGandhi · 2020-10-31T14:30:26Z

We had considered the false approach while originally discussing the bias switch, I think we need to be careful not to do something which isn't generic enough. I agree that adjoints with Zeros can help here

DrChainsaw · 2020-10-31T16:11:02Z

@DhairyaLGandhi I will look into the CUDA.Adaptor. I could not find it when browsing the code on github, but I will search once I'm back in front of the computer. My thinking right now is that the specialized adapt_structure prevents them from turning into CuArrays and the specialized broadcasts will be used even if one argument is a CuArray. Is this about more generic things (like cu([1]) .+ 0 .+ Flux.Zeros() which currently fails)?

I looked through the issue for the param loading case: #1277

It seems that at least the issue with Zeros turning to Arrays is solved by this PR.

I will look at tests for loadparams! and destructure and drop some Zeros in there and see what happens now. It seems as those methods rely on params being AbstractArrays. Right now I can't wrap my head around if things will just work if non-existent biases are just not added to the set of params or not.

One obvious failure case is if someone tries to add a bias to a non-bias layer with loadparams!. This should throw an error as it is outside the scope of just modifying the parameters, right? Restructure otoh should perhaps work...

DrChainsaw · 2020-10-31T19:31:01Z

@DhairyaLGandhi I looked up CUDA.Adaptor in the code but I could still not determine if it had any other purpose than moving stuff to the GPU which is what we want to avoid altogether here, right?

Is this sufficient to tell if it is handled correctly?

julia> cu(Flux.Zeros())
0-dimensional Flux.Zeros{Bool,0}:
0

julia> cu(Flux.Zeros(2))
2-element Flux.Zeros{Bool,1}:
 0
 0

DrChainsaw · 2020-10-31T21:09:19Z

Adding fixes and tests for loadparams! and destructure made it seem slightly more attractive in my eyes to explore the path where bias is not an AbstractArray.

There are two ways one can view the act of disabling bias:

It is immutable (and zero)
It is simply just not there

I think the current approach with Zeros being an AbstractArray tries to do a little bit of both, or at least it sends that message.

For destructure having params as an AbstractArray was actively unhelpful as the contract is that the returned array is something which can be changed. I guess there are packages for exotic arrays which have different types for different parts of the array but I doubt that we want to go there. Whats on the branch now just makes the assumption that bias=Zeros() shall be treated as if the bias just does not exist, meaning they are not included in the destructured array.

loadparams! otoh is a bit of a mixed bag where immutable structs being singletons makes it awkward to e.g load between models where one has bias and the other has not. I'm not sure this is how it is supposed to be used though and for cases when the params to replace with are created manually I can't see how it matters.

DrChainsaw · 2020-10-31T22:21:40Z

Ok, I think functionality wise this PR is now ready as far as I can tell.

I will start drafting another version where Zeros is not an AbstractArray and submit that as a separate PR.

@mcabbott

1379: Fix some issues with Zeros option 2 r=CarloLucibello a=DrChainsaw Fixes #1332, #1277 This is take two on fixing the above issues and is intended to be mutually exclusive to #1374. In this version Zeros is no longer an AbstractArray and is thus more like a there-is-no-such-parameter marker instead of this-parameter-is-an-immutable-array-of-zeros type. I made the change so that the advertised way of disabling bias is through `bias=false` rather than `bias=Zeros()`. `Zeros` itself is alot less capable here compared to the previous attempt, but that also keeps the number of moving parts down, i.e no need to test that both the 0-dim version and the full size version works the same. All in all there are fewer specializations compared to #1374. I think that a rename could definitely be in place. Better names I can think of are `bias=Off`, `bias=None` or as suggested by @mcabbott `bias=False()`. The best argument against renaming I can think of is to be a little bit less breaking. ### PR Checklist - [X] Tests are added - [X] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from `@dhairyagandhi96` (for API changes). Co-authored-by: DrChainsaw <Christian.kyril.skarby@gmail.com>

DrChainsaw · 2020-11-09T00:46:06Z

Redundant after #1379

Fix issue with Zeros being mapped by adapt

a110400

Add reshape implementation for Zeros

DhairyaLGandhi reviewed Oct 30, 2020

View reviewed changes

Change adaot to adapt_structure for Zeros

63d3924

DrChainsaw added 3 commits October 30, 2020 21:54

Check that Zeros bias stays Zeros

51a3ded

Test that gradient of Zeros is nothing on gpu

e9bccac

Add missing mapping

775a739

Add adjoints for ops with Zeros

e6a5fc3

DrChainsaw added 2 commits October 31, 2020 21:29

Add test for loadparams!

effd29f

Zeros are no longer part of destructured params

b570831

DrChainsaw added 2 commits October 31, 2020 23:12

Add adjoints and tests for + and - with Zeros

84d722d

Add test for Zeros and /

a8c492c

DrChainsaw mentioned this pull request Nov 1, 2020

Fix some issues with Zeros option 2 #1379

Merged

4 tasks

DrChainsaw closed this Nov 9, 2020

DrChainsaw mentioned this pull request Nov 20, 2020

Allow Zeros with dimensions #1402

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some issues with Zeros #1374

Fix some issues with Zeros #1374

DrChainsaw commented Oct 28, 2020 •

edited

Loading

DhairyaLGandhi left a comment

DhairyaLGandhi Oct 30, 2020

DhairyaLGandhi Oct 30, 2020

DrChainsaw Oct 30, 2020

DhairyaLGandhi Oct 30, 2020

DrChainsaw Oct 30, 2020

DhairyaLGandhi Oct 30, 2020

DhairyaLGandhi Oct 30, 2020

DrChainsaw Oct 30, 2020

DrChainsaw commented Oct 30, 2020

DrChainsaw commented Oct 30, 2020

DhairyaLGandhi commented Oct 30, 2020

DrChainsaw commented Oct 30, 2020

DrChainsaw commented Oct 30, 2020 •

edited

Loading

DrChainsaw commented Oct 31, 2020 •

edited

Loading

CarloLucibello commented Oct 31, 2020

mcabbott commented Oct 31, 2020 •

edited

Loading

DrChainsaw commented Oct 31, 2020

mcabbott commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020

DhairyaLGandhi commented Oct 31, 2020

DhairyaLGandhi commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020 •

edited

Loading

DrChainsaw commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020

DrChainsaw commented Nov 9, 2020

Fix some issues with Zeros #1374

Fix some issues with Zeros #1374

Conversation

DrChainsaw commented Oct 28, 2020 • edited Loading

PR Checklist

DhairyaLGandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DrChainsaw commented Oct 30, 2020

DrChainsaw commented Oct 30, 2020

DhairyaLGandhi commented Oct 30, 2020

DrChainsaw commented Oct 30, 2020

DrChainsaw commented Oct 30, 2020 • edited Loading

DrChainsaw commented Oct 31, 2020 • edited Loading

CarloLucibello commented Oct 31, 2020

mcabbott commented Oct 31, 2020 • edited Loading

DrChainsaw commented Oct 31, 2020

mcabbott commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020

DhairyaLGandhi commented Oct 31, 2020

DhairyaLGandhi commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020 • edited Loading

DrChainsaw commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020

DrChainsaw commented Oct 31, 2020

DrChainsaw commented Nov 9, 2020

DrChainsaw commented Oct 28, 2020 •

edited

Loading

DrChainsaw commented Oct 30, 2020 •

edited

Loading

DrChainsaw commented Oct 31, 2020 •

edited

Loading

mcabbott commented Oct 31, 2020 •

edited

Loading

DrChainsaw commented Oct 31, 2020 •

edited

Loading