Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient definitions for cpu & gpu #1704

Merged
merged 8 commits into from
Sep 6, 2021
Merged

Gradient definitions for cpu & gpu #1704

merged 8 commits into from
Sep 6, 2021

Conversation

mcabbott
Copy link
Member

Closes #1695. Closes FluxML/Zygote.jl#1005

Not well tested locally, so we will see what CI thinks.

I'm not very confident this uses fmap correctly. Even thornier test cases would be welcome.

@mcabbott mcabbott closed this Aug 31, 2021
@mcabbott mcabbott reopened this Aug 31, 2021
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that is changed in a breaking manner in CUDA, so should be addressed there.


# While `cu` moves Arrays to the GPU, we also want to move some structured arrays
# https://github.com/FluxML/Zygote.jl/issues/1005
_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x)) # gradient of sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the behaviour of CUDA.cu and will eagerly materialise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, CUDA.cu deliberately is untouched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu doesn't mirror cu anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that's why "changing the behaviour of CUDA.cu" is incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior is that it should.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want new behaviour, and not to pirate CUDA, and the same behaviour? I hope it's clear that you cannot have all three. This PR makes a choice. (Not the choice you claimed, though.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that there is no new behaviour. There is a bug that needs to be identified with the cu conversions. @maleadt would be able to say.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not. The only bug reports that may have been related were inconclusive (the Flux issue just showing a Zygote error) or caused by other issues (RecursiveArrayTools missing an essential constructor).

src/functor.jl Outdated
_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x)) # gradient of sum
function _gpu_array(x::Zygote.OneElement) # gradient of getindex
y = CUDA.zeros(eltype(x), size(x))
CUDA.@allowscalar y[x.ind...] = x.val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it may be too easy to trigger scalar operations silently now. Reporting that they are happening is better for performance and proper use of CUDA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is precisely to solve the issue linked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, events of scalar indexing on GPU are reported as errors precisely because they don't work with the way a GPU is expected to work. This is bypassing that assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing roughly the same thing as https://github.com/FluxML/Zygote.jl/pull/880/files#diff-1040a6fce812f91964b258de2a02fb69296c75a9701c3326df2671ab9ea7e5f0R284 . I don't see loud objections that that should have been left an error because that's how "GPU is expected to work".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew OneElement was limited benefit over generalised accum, but now I'm wondering if it can silently not report scalar indexing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it can silently not report scalar indexing

Yes, that's what @allowscalar does. Just like vcat does.

src/functor.jl Outdated

function Zygote.ChainRules.rrule(::typeof(_cpu_array), x::AbstractArray)
y = _cpu_array(x)
if x === y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be necessary if we don't override CUDA's behavior. If we want a way to work with structured matrices, making a case for them not working and addressing in Adapt is the official way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea what that means. What CUDA behaviour is being overridden?

Flux.cpu should be a no-op on CPU arrays, including in the gradient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, Adapt gets us this check for cpu and gpu cases for free.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if you have another approach, please write it up somewhere. Make a better PR. Snarky comments don't actually move anything forwards.

The tests here ought to pass (well, all but the last pair, possibly). The mechanism to do so I'm not at all attached to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: will the assumption that x !== y means y is a GPU array too strong here?

@mcabbott
Copy link
Member Author

mcabbott commented Aug 31, 2021

This is something that is changed in a breaking manner in CUDA

What does "this" refer to here? The PR's first message claims to fix two issues. One of them involves cpu which, as far as I can tell, never worked. The other involves getindex, which also isn't a CUDA change. (And there's a parallel issue with sum, Fill, since you don't like getindex, obviously.) So it is incorrect to blame this on a CUDA change.

It's possible there were changes to cu which this PR also works around, by doing things in functions Flux owns instead of waiting for ones CUDA owns. It's possible those could be solved in other ways. Feel free to link other PRs once you've made them.

src/functor.jl Outdated
y = _cpu_array(x)
if x === y
# Trivial use: cpu(x::Array) shouldn't push its gradient to GPU
return y, dy -> (Zygote.ChainRules.NoTangent(), dy)
Copy link
Contributor

@johnnychen94 johnnychen94 Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: what's the advantage of this if-branch over

return y, dy -> (Zygote.ChainRules.NoTangent(), adapt(basetype(x), dy))

with basetype defined as in JuliaLang/julia#35543

basetype(T::Type) = Base.typename(T).wrapper
basetype(x::T) where T = basetype(T)

Although there are some overheads here in basetype, I still feel it's more generic than the x===y if-branch here. Or in other words, I feel we need more information from adapt, not just the output data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would catch Array <-> Cuarray, but miss a few other cases this PR wants.

First, it will be confused by wrappers like Adjoint, which adapt knows to look inside of:

julia> x = rand(3)';

julia> basetype(x)
Adjoint

julia> adapt(Array, x) === x
true

Second, adapt doesn't move structured arrays produced (for efficiency) by some Zygote gradients. The gradient of sum (on the CPU) is a Fill, but if the forward pass went (x::CuArray) |> cpu |> sum, the reverse will need to regard this as being a CPU array & make a CuArray. But adapt doesn't do that:

julia> adapt(CuArray, Fill(1,2,3))
2×3 Fill{Int64}, with entries equal to 1

which is why there needs to be another function involved, to which we can attach this gradient behaviour, without piracy.

There might be a more clever way to allow for trivial use like cpu(::Array) instead of x===y, but I didn't see it yet. I believe the branch should be cheap, especially when the types do differ, which is the case we really care about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structured arrays and wrappers are the domain of Adapt. Also, the ones we are interested in for Zygote work with CUDA.

julia> CUDA.zeros(3) + Fill(3,3)
3-element CuArray{Float32, 1}:
 3.0
 3.0
 3.0

julia> gradient((x,y) -> sum(cpu(x) + y), CUDA.zeros(3), Fill(3,3))
(3-element Fill{Float32}: entries equal to 1.0, 3-element Fill{Float32}: entries equal to 1.0)

julia> gradient((x,y) -> sum(cpu(x) + y), zeros(3), Fill(3,3))
(3-element Fill{Float64}: entries equal to 1.0, 3-element Fill{Float64}: entries equal to 1.0)

julia> gradient((x,y) -> sum(gpu(x) + y), zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])

julia> gradient((x,y) -> sum(gpu(x) + y), CUDA.zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])

so it does seem like if something does in fact look like the issue is elsewhere, and adapt needs to be given more information, or perhaps a missing adapt rule somewhere.

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue may be that adapt(::Array, ::Fill) allocates, which mixes types if it interacts with GPU arrays.

julia> CUDA.Adapt.adapt(Array, Fill(3,3))
3-element Vector{Int64}:
 3
 3
 3

This would need to be addressed in Adapt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that some operations between CuArrays and FillArrays do work, as they are overloaded. But many don't, like *, or worse, conv. That's why Zygote at present does not make a Fill for the gradient of sum(::CuArray); this PR copies that precedent.

Copy link
Member Author

@mcabbott mcabbott Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's bizarre that adapt(Array, Fill(3,3)) makes a Vector, but adapt(CuArray, Fill(3,3)) does nothing.

Note that at present, this will only be triggered if you call cpu(::Fill), what should be a pass-through isn't quite. This is true of the tagged version of Flux, too; I'd call that a bug:

julia> cpu(Fill(3,3))
3-element Vector{Int64}:
 3
 3
 3

It isn't triggered in the gradient like x::Array |> cpu |> sum, which is a case currently tested. Easy to fix though.

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adapt should make its Fill behaviour with Cu/Arrays, more explicit.

like *,

This should be made to work in FillArrays/ Adapt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR for * in FillArrays, but it gets pretty hairy. And there are many other operations, like I said.

You can try to persuade Adapt to depend on FillArrays, or the reverse, but I think you will have a hard time.

Anyway I think not changing sum(::CuArray) is the conservative position here. This PR tries to make sure that cpu follows that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't changing the sum method here.

@CarloLucibello
Copy link
Member

I think we should go with this

@mcabbott
Copy link
Member Author

mcabbott commented Sep 6, 2021

Great, let's do this.

If someone later figures out a different implementation, perhaps involving more Adapt, that's fine too. At least we will have tests which check that it actually works.

bors r+

@bors
Copy link
Contributor

bors bot commented Sep 6, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems with a mixed CPU/GPU model Scalar indexing error from GPU matmul against Zygote.OneElement
5 participants