-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient definitions for cpu
& gpu
#1704
Changes from 5 commits
6ffbe26
682a73e
daacf52
e5a6985
726e270
f39a0e5
c732e49
65e8f70
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,7 +86,20 @@ julia> typeof(m_cpu.W) | |
Matrix{Float32} | ||
``` | ||
""" | ||
cpu(m) = fmap(x -> adapt(Array, x), m) | ||
cpu(x) = fmap(_cpu_array, x; exclude = _isbitsarray) | ||
|
||
_cpu_array(x::AbstractArray) = adapt(Array, x) | ||
|
||
function Zygote.ChainRules.rrule(::typeof(_cpu_array), x::AbstractArray) | ||
y = _cpu_array(x) | ||
if x === y | ||
# Trivial use: cpu(x::Array) shouldn't push its gradient to GPU | ||
return y, dy -> (Zygote.ChainRules.NoTangent(), dy) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another question: what's the advantage of this if-branch over return y, dy -> (Zygote.ChainRules.NoTangent(), adapt(basetype(x), dy)) with basetype(T::Type) = Base.typename(T).wrapper
basetype(x::T) where T = basetype(T) Although there are some overheads here in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would catch Array <-> Cuarray, but miss a few other cases this PR wants. First, it will be confused by wrappers like Adjoint, which adapt knows to look inside of:
Second,
which is why there needs to be another function involved, to which we can attach this gradient behaviour, without piracy. There might be a more clever way to allow for trivial use like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Structured arrays and wrappers are the domain of Adapt. Also, the ones we are interested in for Zygote work with CUDA. julia> CUDA.zeros(3) + Fill(3,3)
3-element CuArray{Float32, 1}:
3.0
3.0
3.0
julia> gradient((x,y) -> sum(cpu(x) + y), CUDA.zeros(3), Fill(3,3))
(3-element Fill{Float32}: entries equal to 1.0, 3-element Fill{Float32}: entries equal to 1.0)
julia> gradient((x,y) -> sum(cpu(x) + y), zeros(3), Fill(3,3))
(3-element Fill{Float64}: entries equal to 1.0, 3-element Fill{Float64}: entries equal to 1.0)
julia> gradient((x,y) -> sum(gpu(x) + y), zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])
julia> gradient((x,y) -> sum(gpu(x) + y), CUDA.zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0]) so it does seem like if something does in fact look like the issue is elsewhere, and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue may be that julia> CUDA.Adapt.adapt(Array, Fill(3,3))
3-element Vector{Int64}:
3
3
3 This would need to be addressed in Adapt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's true that some operations between CuArrays and FillArrays do work, as they are overloaded. But many don't, like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, it's bizarre that Note that at present, this will only be triggered if you call
It isn't triggered in the gradient like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe adapt should make its
This should be made to work in FillArrays/ Adapt There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a PR for You can try to persuade Adapt to depend on FillArrays, or the reverse, but I think you will have a hard time. Anyway I think not changing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We aren't changing the |
||
else | ||
# Allows both cpu(x::CuArray) and cpu(x::Adjoint{T,CuArray}): | ||
return y, dy -> (Zygote.ChainRules.NoTangent(), _gpu_array(dy)) | ||
end | ||
end | ||
|
||
_isbitsarray(::AbstractArray{<:Number}) = true | ||
_isbitsarray(::AbstractArray{T}) where T = isbitstype(T) | ||
|
@@ -99,8 +112,7 @@ Moves `m` to the current GPU device, if available. It is a no-op otherwise. | |
See the [CUDA.jl docs](https://juliagpu.github.io/CUDA.jl/stable/usage/multigpu/) | ||
to help identify the current device. | ||
|
||
This works for functions and | ||
any struct with [`@functor`](@ref) defined. | ||
This works for functions, and any struct marked with [`@functor`](@ref). | ||
|
||
```julia-repl | ||
julia> m = Dense(1,2) | ||
|
@@ -116,7 +128,27 @@ julia> typeof(m_gpu.W) # notice the type of the array changed to a CuArray | |
CuArray{Float32, 2} | ||
``` | ||
""" | ||
gpu(x) = use_cuda[] ? fmap(CUDA.cu, x; exclude = _isbitsarray) : x | ||
gpu(x) = use_cuda[] ? fmap(_gpu_array, x; exclude = _isbitsarray) : x | ||
|
||
_gpu_array(x::AbstractArray) = CUDA.cu(x) | ||
|
||
# While `cu` moves Arrays to the GPU, we also want to move some structured arrays | ||
# https://github.com/FluxML/Zygote.jl/issues/1005 | ||
_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x)) # gradient of sum | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is changing the behaviour of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct, that's why "changing the behaviour of CUDA.cu" is incorrect. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The expected behavior is that it should. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we want new behaviour, and not to pirate CUDA, and the same behaviour? I hope it's clear that you cannot have all three. This PR makes a choice. (Not the choice you claimed, though.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point is that there is no new behaviour. There is a bug that needs to be identified with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not. The only bug reports that may have been related were inconclusive (the Flux issue just showing a Zygote error) or caused by other issues (RecursiveArrayTools missing an essential constructor). |
||
function _gpu_array(x::Zygote.OneElement) # gradient of getindex | ||
y = CUDA.zeros(eltype(x), size(x)) | ||
CUDA.@allowscalar y[x.ind...] = x.val | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like it may be too easy to trigger scalar operations silently now. Reporting that they are happening is better for performance and proper use of CUDA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, this is precisely to solve the issue linked. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, events of scalar indexing on GPU are reported as errors precisely because they don't work with the way a GPU is expected to work. This is bypassing that assumption. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is doing roughly the same thing as https://github.com/FluxML/Zygote.jl/pull/880/files#diff-1040a6fce812f91964b258de2a02fb69296c75a9701c3326df2671ab9ea7e5f0R284 . I don't see loud objections that that should have been left an error because that's how "GPU is expected to work". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I knew There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's what |
||
y | ||
end | ||
|
||
function Zygote.ChainRules.rrule(::typeof(_gpu_array), x::AbstractArray) | ||
y = _gpu_array(x) | ||
if x === y # trivial case, e.g. gpu(x::Adjoint{T,CuArray}) | ||
return y, dy -> (Zygote.ChainRules.NoTangent(), dy) | ||
else | ||
return y, dy -> (Zygote.ChainRules.NoTangent(), _cpu_array(dy)) | ||
end | ||
end | ||
|
||
# Precision | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be necessary if we don't override CUDA's behavior. If we want a way to work with structured matrices, making a case for them not working and addressing in Adapt is the official way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea what that means. What CUDA behaviour is being overridden?
Flux.cpu
should be a no-op on CPU arrays, including in the gradient.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, Adapt gets us this check for cpu and gpu cases for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if you have another approach, please write it up somewhere. Make a better PR. Snarky comments don't actually move anything forwards.
The tests here ought to pass (well, all but the last pair, possibly). The mechanism to do so I'm not at all attached to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question: will the assumption that
x !== y
meansy
is a GPU array too strong here?