Gradient definitions for `cpu` & `gpu` #1704

mcabbott · 2021-08-31T02:12:38Z

Closes #1695. Closes FluxML/Zygote.jl#1005

Not well tested locally, so we will see what CI thinks.

I'm not very confident this uses fmap correctly. Even thornier test cases would be welcome.

DhairyaLGandhi

This is something that is changed in a breaking manner in CUDA, so should be addressed there.

DhairyaLGandhi · 2021-08-31T04:54:13Z

src/functor.jl

+
+# While `cu` moves Arrays to the GPU, we also want to move some structured arrays
+# https://github.com/FluxML/Zygote.jl/issues/1005
+_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x))  # gradient of sum


This is changing the behaviour of CUDA.cu and will eagerly materialise.

No, CUDA.cu deliberately is untouched.

gpu doesn't mirror cu anymore

Correct, that's why "changing the behaviour of CUDA.cu" is incorrect.

The expected behavior is that it should.

So we want new behaviour, and not to pirate CUDA, and the same behaviour? I hope it's clear that you cannot have all three. This PR makes a choice. (Not the choice you claimed, though.)

The point is that there is no new behaviour. There is a bug that needs to be identified with the cu conversions. @maleadt would be able to say.

I'm not. The only bug reports that may have been related were inconclusive (the Flux issue just showing a Zygote error) or caused by other issues (RecursiveArrayTools missing an essential constructor).

DhairyaLGandhi · 2021-08-31T04:56:19Z

src/functor.jl

+_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x))  # gradient of sum
+function _gpu_array(x::Zygote.OneElement)  # gradient of getindex
+    y = CUDA.zeros(eltype(x), size(x))
+    CUDA.@allowscalar y[x.ind...] = x.val


Seems like it may be too easy to trigger scalar operations silently now. Reporting that they are happening is better for performance and proper use of CUDA.

No, this is precisely to solve the issue linked.

No, events of scalar indexing on GPU are reported as errors precisely because they don't work with the way a GPU is expected to work. This is bypassing that assumption.

This is doing roughly the same thing as https://github.com/FluxML/Zygote.jl/pull/880/files#diff-1040a6fce812f91964b258de2a02fb69296c75a9701c3326df2671ab9ea7e5f0R284 . I don't see loud objections that that should have been left an error because that's how "GPU is expected to work".

I knew OneElement was limited benefit over generalised accum, but now I'm wondering if it can silently not report scalar indexing.

if it can silently not report scalar indexing

Yes, that's what @allowscalar does. Just like vcat does.

DhairyaLGandhi · 2021-08-31T04:58:59Z

src/functor.jl

+
+function Zygote.ChainRules.rrule(::typeof(_cpu_array), x::AbstractArray)
+    y = _cpu_array(x)
+    if x === y


Wouldn't be necessary if we don't override CUDA's behavior. If we want a way to work with structured matrices, making a case for them not working and addressing in Adapt is the official way.

No idea what that means. What CUDA behaviour is being overridden?

Flux.cpu should be a no-op on CPU arrays, including in the gradient.

I mean, Adapt gets us this check for cpu and gpu cases for free.

I mean, if you have another approach, please write it up somewhere. Make a better PR. Snarky comments don't actually move anything forwards.

The tests here ought to pass (well, all but the last pair, possibly). The mechanism to do so I'm not at all attached to.

Just a question: will the assumption that x !== y means y is a GPU array too strong here?

mcabbott · 2021-08-31T05:24:27Z

This is something that is changed in a breaking manner in CUDA

What does "this" refer to here? The PR's first message claims to fix two issues. One of them involves cpu which, as far as I can tell, never worked. The other involves getindex, which also isn't a CUDA change. (And there's a parallel issue with sum, Fill, since you don't like getindex, obviously.) So it is incorrect to blame this on a CUDA change.

It's possible there were changes to cu which this PR also works around, by doing things in functions Flux owns instead of waiting for ones CUDA owns. It's possible those could be solved in other ways. Feel free to link other PRs once you've made them.

johnnychen94 · 2021-08-31T08:17:16Z

src/functor.jl

+    y = _cpu_array(x)
+    if x === y
+        # Trivial use: cpu(x::Array) shouldn't push its gradient to GPU
+        return y, dy -> (Zygote.ChainRules.NoTangent(), dy)


Another question: what's the advantage of this if-branch over

return y, dy -> (Zygote.ChainRules.NoTangent(), adapt(basetype(x), dy))

with basetype defined as in JuliaLang/julia#35543

basetype(T::Type) = Base.typename(T).wrapper basetype(x::T) where T = basetype(T)

Although there are some overheads here in basetype, I still feel it's more generic than the x===y if-branch here. Or in other words, I feel we need more information from adapt, not just the output data.

This would catch Array <-> Cuarray, but miss a few other cases this PR wants.

First, it will be confused by wrappers like Adjoint, which adapt knows to look inside of:

julia> x = rand(3)'; julia> basetype(x) Adjoint julia> adapt(Array, x) === x true

Second, adapt doesn't move structured arrays produced (for efficiency) by some Zygote gradients. The gradient of sum (on the CPU) is a Fill, but if the forward pass went (x::CuArray) |> cpu |> sum, the reverse will need to regard this as being a CPU array & make a CuArray. But adapt doesn't do that:

julia> adapt(CuArray, Fill(1,2,3)) 2×3 Fill{Int64}, with entries equal to 1

which is why there needs to be another function involved, to which we can attach this gradient behaviour, without piracy.

There might be a more clever way to allow for trivial use like cpu(::Array) instead of x===y, but I didn't see it yet. I believe the branch should be cheap, especially when the types do differ, which is the case we really care about.

Structured arrays and wrappers are the domain of Adapt. Also, the ones we are interested in for Zygote work with CUDA.

julia> CUDA.zeros(3) + Fill(3,3) 3-element CuArray{Float32, 1}: 3.0 3.0 3.0 julia> gradient((x,y) -> sum(cpu(x) + y), CUDA.zeros(3), Fill(3,3)) (3-element Fill{Float32}: entries equal to 1.0, 3-element Fill{Float32}: entries equal to 1.0) julia> gradient((x,y) -> sum(cpu(x) + y), zeros(3), Fill(3,3)) (3-element Fill{Float64}: entries equal to 1.0, 3-element Fill{Float64}: entries equal to 1.0) julia> gradient((x,y) -> sum(gpu(x) + y), zeros(3), Fill(3,3)) (Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0]) julia> gradient((x,y) -> sum(gpu(x) + y), CUDA.zeros(3), Fill(3,3)) (Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])

so it does seem like if something does in fact look like the issue is elsewhere, and adapt needs to be given more information, or perhaps a missing adapt rule somewhere.

The issue may be that adapt(::Array, ::Fill) allocates, which mixes types if it interacts with GPU arrays.

julia> CUDA.Adapt.adapt(Array, Fill(3,3)) 3-element Vector{Int64}: 3 3 3

This would need to be addressed in Adapt.

It's true that some operations between CuArrays and FillArrays do work, as they are overloaded. But many don't, like *, or worse, conv. That's why Zygote at present does not make a Fill for the gradient of sum(::CuArray); this PR copies that precedent.

Oh, it's bizarre that adapt(Array, Fill(3,3)) makes a Vector, but adapt(CuArray, Fill(3,3)) does nothing.

Note that at present, this will only be triggered if you call cpu(::Fill), what should be a pass-through isn't quite. This is true of the tagged version of Flux, too; I'd call that a bug:

julia> cpu(Fill(3,3)) 3-element Vector{Int64}: 3 3 3

It isn't triggered in the gradient like x::Array |> cpu |> sum, which is a case currently tested. Easy to fix though.

Maybe adapt should make its Fill behaviour with Cu/Arrays, more explicit.

like *,

This should be made to work in FillArrays/ Adapt

There is a PR for * in FillArrays, but it gets pretty hairy. And there are many other operations, like I said.

You can try to persuade Adapt to depend on FillArrays, or the reverse, but I think you will have a hard time.

Anyway I think not changing sum(::CuArray) is the conservative position here. This PR tries to make sure that cpu follows that.

We aren't changing the sum method here.

CarloLucibello · 2021-09-03T08:05:41Z

I think we should go with this

mcabbott · 2021-09-06T15:46:56Z

Great, let's do this.

If someone later figures out a different implementation, perhaps involving more Adapt, that's fine too. At least we will have tests which check that it actually works.

bors r+

bors · 2021-09-06T16:59:07Z

Build succeeded:

buildkite/flux-dot-jl

mcabbott added 5 commits August 30, 2021 22:08

add gradients for cpu, gpu

6ffbe26

fix comments

682a73e

qualify some names

daacf52

m is x

e5a6985

qualify more things, test inaction with ===

726e270

mcabbott closed this Aug 31, 2021

mcabbott reopened this Aug 31, 2021

DhairyaLGandhi reviewed Aug 31, 2021

View reviewed changes

johnnychen94 reviewed Aug 31, 2021

View reviewed changes

mcabbott added 3 commits September 3, 2021 10:37

tweaks & tidying

f39a0e5

qualify

c732e49

two spaces

65e8f70

mcabbott requested a review from CarloLucibello September 6, 2021 14:47

CarloLucibello approved these changes Sep 6, 2021

View reviewed changes

bors bot merged commit 2468a06 into FluxML:master Sep 6, 2021

mcabbott deleted the transfer branch September 6, 2021 17:03

mcabbott mentioned this pull request Sep 6, 2021

Adapt and FillArrays JuliaGPU/Adapt.jl#47

Closed

DhairyaLGandhi mentioned this pull request Sep 6, 2021

Allow buffer typevar in CuArray type FluxML/Zygote.jl#1062

Merged

CarloLucibello mentioned this pull request Sep 25, 2021

tag patch version #1721

Merged

mcabbott mentioned this pull request Sep 28, 2021

Zygote.OneElement does not properly reshape FluxML/Zygote.jl#1080

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient definitions for `cpu` & `gpu` #1704

Gradient definitions for `cpu` & `gpu` #1704

mcabbott commented Aug 31, 2021

DhairyaLGandhi left a comment

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

maleadt Sep 6, 2021

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

mcabbott Aug 31, 2021

johnnychen94 Aug 31, 2021

mcabbott commented Aug 31, 2021 •

edited

Loading

johnnychen94 Aug 31, 2021 •

edited

Loading

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

DhairyaLGandhi Aug 31, 2021 •

edited

Loading

mcabbott Aug 31, 2021

mcabbott Aug 31, 2021 •

edited

Loading

DhairyaLGandhi Aug 31, 2021 •

edited

Loading

mcabbott Aug 31, 2021

DhairyaLGandhi Aug 31, 2021

CarloLucibello commented Sep 3, 2021

mcabbott commented Sep 6, 2021

bors bot commented Sep 6, 2021

Gradient definitions for cpu & gpu #1704

Gradient definitions for cpu & gpu #1704

Conversation

mcabbott commented Aug 31, 2021

DhairyaLGandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott commented Aug 31, 2021 • edited Loading

johnnychen94 Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DhairyaLGandhi Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

DhairyaLGandhi Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloLucibello commented Sep 3, 2021

mcabbott commented Sep 6, 2021

bors bot commented Sep 6, 2021

Gradient definitions for `cpu` & `gpu` #1704

Gradient definitions for `cpu` & `gpu` #1704

mcabbott commented Aug 31, 2021 •

edited

Loading

johnnychen94 Aug 31, 2021 •

edited

Loading

DhairyaLGandhi Aug 31, 2021 •

edited

Loading

mcabbott Aug 31, 2021 •

edited

Loading

DhairyaLGandhi Aug 31, 2021 •

edited

Loading