-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient definitions for cpu
& gpu
#1704
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that is changed in a breaking manner in CUDA, so should be addressed there.
|
||
# While `cu` moves Arrays to the GPU, we also want to move some structured arrays | ||
# https://github.com/FluxML/Zygote.jl/issues/1005 | ||
_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x)) # gradient of sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing the behaviour of CUDA.cu
and will eagerly materialise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, CUDA.cu
deliberately is untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu
doesn't mirror cu
anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, that's why "changing the behaviour of CUDA.cu" is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected behavior is that it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we want new behaviour, and not to pirate CUDA, and the same behaviour? I hope it's clear that you cannot have all three. This PR makes a choice. (Not the choice you claimed, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that there is no new behaviour. There is a bug that needs to be identified with the cu
conversions. @maleadt would be able to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not. The only bug reports that may have been related were inconclusive (the Flux issue just showing a Zygote error) or caused by other issues (RecursiveArrayTools missing an essential constructor).
src/functor.jl
Outdated
_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x)) # gradient of sum | ||
function _gpu_array(x::Zygote.OneElement) # gradient of getindex | ||
y = CUDA.zeros(eltype(x), size(x)) | ||
CUDA.@allowscalar y[x.ind...] = x.val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it may be too easy to trigger scalar operations silently now. Reporting that they are happening is better for performance and proper use of CUDA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is precisely to solve the issue linked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, events of scalar indexing on GPU are reported as errors precisely because they don't work with the way a GPU is expected to work. This is bypassing that assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is doing roughly the same thing as https://github.com/FluxML/Zygote.jl/pull/880/files#diff-1040a6fce812f91964b258de2a02fb69296c75a9701c3326df2671ab9ea7e5f0R284 . I don't see loud objections that that should have been left an error because that's how "GPU is expected to work".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew OneElement
was limited benefit over generalised accum
, but now I'm wondering if it can silently not report scalar indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it can silently not report scalar indexing
Yes, that's what @allowscalar
does. Just like vcat does.
src/functor.jl
Outdated
|
||
function Zygote.ChainRules.rrule(::typeof(_cpu_array), x::AbstractArray) | ||
y = _cpu_array(x) | ||
if x === y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be necessary if we don't override CUDA's behavior. If we want a way to work with structured matrices, making a case for them not working and addressing in Adapt is the official way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea what that means. What CUDA behaviour is being overridden?
Flux.cpu
should be a no-op on CPU arrays, including in the gradient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, Adapt gets us this check for cpu and gpu cases for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if you have another approach, please write it up somewhere. Make a better PR. Snarky comments don't actually move anything forwards.
The tests here ought to pass (well, all but the last pair, possibly). The mechanism to do so I'm not at all attached to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question: will the assumption that x !== y
means y
is a GPU array too strong here?
What does "this" refer to here? The PR's first message claims to fix two issues. One of them involves It's possible there were changes to |
src/functor.jl
Outdated
y = _cpu_array(x) | ||
if x === y | ||
# Trivial use: cpu(x::Array) shouldn't push its gradient to GPU | ||
return y, dy -> (Zygote.ChainRules.NoTangent(), dy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question: what's the advantage of this if-branch over
return y, dy -> (Zygote.ChainRules.NoTangent(), adapt(basetype(x), dy))
with basetype
defined as in JuliaLang/julia#35543
basetype(T::Type) = Base.typename(T).wrapper
basetype(x::T) where T = basetype(T)
Although there are some overheads here in basetype
, I still feel it's more generic than the x===y
if-branch here. Or in other words, I feel we need more information from adapt
, not just the output data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would catch Array <-> Cuarray, but miss a few other cases this PR wants.
First, it will be confused by wrappers like Adjoint, which adapt knows to look inside of:
julia> x = rand(3)';
julia> basetype(x)
Adjoint
julia> adapt(Array, x) === x
true
Second, adapt
doesn't move structured arrays produced (for efficiency) by some Zygote gradients. The gradient of sum (on the CPU) is a Fill
, but if the forward pass went (x::CuArray) |> cpu |> sum
, the reverse will need to regard this as being a CPU array & make a CuArray. But adapt doesn't do that:
julia> adapt(CuArray, Fill(1,2,3))
2×3 Fill{Int64}, with entries equal to 1
which is why there needs to be another function involved, to which we can attach this gradient behaviour, without piracy.
There might be a more clever way to allow for trivial use like cpu(::Array)
instead of x===y
, but I didn't see it yet. I believe the branch should be cheap, especially when the types do differ, which is the case we really care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structured arrays and wrappers are the domain of Adapt. Also, the ones we are interested in for Zygote work with CUDA.
julia> CUDA.zeros(3) + Fill(3,3)
3-element CuArray{Float32, 1}:
3.0
3.0
3.0
julia> gradient((x,y) -> sum(cpu(x) + y), CUDA.zeros(3), Fill(3,3))
(3-element Fill{Float32}: entries equal to 1.0, 3-element Fill{Float32}: entries equal to 1.0)
julia> gradient((x,y) -> sum(cpu(x) + y), zeros(3), Fill(3,3))
(3-element Fill{Float64}: entries equal to 1.0, 3-element Fill{Float64}: entries equal to 1.0)
julia> gradient((x,y) -> sum(gpu(x) + y), zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])
julia> gradient((x,y) -> sum(gpu(x) + y), CUDA.zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])
so it does seem like if something does in fact look like the issue is elsewhere, and adapt
needs to be given more information, or perhaps a missing adapt
rule somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue may be that adapt(::Array, ::Fill)
allocates, which mixes types if it interacts with GPU arrays.
julia> CUDA.Adapt.adapt(Array, Fill(3,3))
3-element Vector{Int64}:
3
3
3
This would need to be addressed in Adapt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that some operations between CuArrays and FillArrays do work, as they are overloaded. But many don't, like *
, or worse, conv
. That's why Zygote at present does not make a Fill for the gradient of sum(::CuArray)
; this PR copies that precedent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's bizarre that adapt(Array, Fill(3,3))
makes a Vector, but adapt(CuArray, Fill(3,3))
does nothing.
Note that at present, this will only be triggered if you call cpu(::Fill)
, what should be a pass-through isn't quite. This is true of the tagged version of Flux, too; I'd call that a bug:
julia> cpu(Fill(3,3))
3-element Vector{Int64}:
3
3
3
It isn't triggered in the gradient like x::Array |> cpu |> sum
, which is a case currently tested. Easy to fix though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe adapt should make its Fill
behaviour with Cu/Arrays
, more explicit.
like *,
This should be made to work in FillArrays/ Adapt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a PR for *
in FillArrays, but it gets pretty hairy. And there are many other operations, like I said.
You can try to persuade Adapt to depend on FillArrays, or the reverse, but I think you will have a hard time.
Anyway I think not changing sum(::CuArray)
is the conservative position here. This PR tries to make sure that cpu
follows that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't changing the sum
method here.
I think we should go with this |
Great, let's do this. If someone later figures out a different implementation, perhaps involving more Adapt, that's fine too. At least we will have tests which check that it actually works. bors r+ |
Build succeeded: |
Closes #1695. Closes FluxML/Zygote.jl#1005
Not well tested locally, so we will see what CI thinks.
I'm not very confident this uses
fmap
correctly. Even thornier test cases would be welcome.