Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient definitions for cpu & gpu #1704

Merged
merged 8 commits into from
Sep 6, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 36 additions & 4 deletions src/functor.jl
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,20 @@ julia> typeof(m_cpu.W)
Matrix{Float32}
```
"""
cpu(m) = fmap(x -> adapt(Array, x), m)
cpu(x) = fmap(_cpu_array, x; exclude = _isbitsarray)

_cpu_array(x::AbstractArray) = adapt(Array, x)

function Zygote.ChainRules.rrule(::typeof(_cpu_array), x::AbstractArray)
y = _cpu_array(x)
if x === y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be necessary if we don't override CUDA's behavior. If we want a way to work with structured matrices, making a case for them not working and addressing in Adapt is the official way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea what that means. What CUDA behaviour is being overridden?

Flux.cpu should be a no-op on CPU arrays, including in the gradient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, Adapt gets us this check for cpu and gpu cases for free.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if you have another approach, please write it up somewhere. Make a better PR. Snarky comments don't actually move anything forwards.

The tests here ought to pass (well, all but the last pair, possibly). The mechanism to do so I'm not at all attached to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: will the assumption that x !== y means y is a GPU array too strong here?

# Trivial use: cpu(x::Array) shouldn't push its gradient to GPU
return y, dy -> (Zygote.ChainRules.NoTangent(), dy)
Copy link
Contributor

@johnnychen94 johnnychen94 Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: what's the advantage of this if-branch over

return y, dy -> (Zygote.ChainRules.NoTangent(), adapt(basetype(x), dy))

with basetype defined as in JuliaLang/julia#35543

basetype(T::Type) = Base.typename(T).wrapper
basetype(x::T) where T = basetype(T)

Although there are some overheads here in basetype, I still feel it's more generic than the x===y if-branch here. Or in other words, I feel we need more information from adapt, not just the output data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would catch Array <-> Cuarray, but miss a few other cases this PR wants.

First, it will be confused by wrappers like Adjoint, which adapt knows to look inside of:

julia> x = rand(3)';

julia> basetype(x)
Adjoint

julia> adapt(Array, x) === x
true

Second, adapt doesn't move structured arrays produced (for efficiency) by some Zygote gradients. The gradient of sum (on the CPU) is a Fill, but if the forward pass went (x::CuArray) |> cpu |> sum, the reverse will need to regard this as being a CPU array & make a CuArray. But adapt doesn't do that:

julia> adapt(CuArray, Fill(1,2,3))
2×3 Fill{Int64}, with entries equal to 1

which is why there needs to be another function involved, to which we can attach this gradient behaviour, without piracy.

There might be a more clever way to allow for trivial use like cpu(::Array) instead of x===y, but I didn't see it yet. I believe the branch should be cheap, especially when the types do differ, which is the case we really care about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structured arrays and wrappers are the domain of Adapt. Also, the ones we are interested in for Zygote work with CUDA.

julia> CUDA.zeros(3) + Fill(3,3)
3-element CuArray{Float32, 1}:
 3.0
 3.0
 3.0

julia> gradient((x,y) -> sum(cpu(x) + y), CUDA.zeros(3), Fill(3,3))
(3-element Fill{Float32}: entries equal to 1.0, 3-element Fill{Float32}: entries equal to 1.0)

julia> gradient((x,y) -> sum(cpu(x) + y), zeros(3), Fill(3,3))
(3-element Fill{Float64}: entries equal to 1.0, 3-element Fill{Float64}: entries equal to 1.0)

julia> gradient((x,y) -> sum(gpu(x) + y), zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])

julia> gradient((x,y) -> sum(gpu(x) + y), CUDA.zeros(3), Fill(3,3))
(Float32[1.0, 1.0, 1.0], Float32[1.0, 1.0, 1.0])

so it does seem like if something does in fact look like the issue is elsewhere, and adapt needs to be given more information, or perhaps a missing adapt rule somewhere.

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue may be that adapt(::Array, ::Fill) allocates, which mixes types if it interacts with GPU arrays.

julia> CUDA.Adapt.adapt(Array, Fill(3,3))
3-element Vector{Int64}:
 3
 3
 3

This would need to be addressed in Adapt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that some operations between CuArrays and FillArrays do work, as they are overloaded. But many don't, like *, or worse, conv. That's why Zygote at present does not make a Fill for the gradient of sum(::CuArray); this PR copies that precedent.

Copy link
Member Author

@mcabbott mcabbott Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's bizarre that adapt(Array, Fill(3,3)) makes a Vector, but adapt(CuArray, Fill(3,3)) does nothing.

Note that at present, this will only be triggered if you call cpu(::Fill), what should be a pass-through isn't quite. This is true of the tagged version of Flux, too; I'd call that a bug:

julia> cpu(Fill(3,3))
3-element Vector{Int64}:
 3
 3
 3

It isn't triggered in the gradient like x::Array |> cpu |> sum, which is a case currently tested. Easy to fix though.

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adapt should make its Fill behaviour with Cu/Arrays, more explicit.

like *,

This should be made to work in FillArrays/ Adapt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR for * in FillArrays, but it gets pretty hairy. And there are many other operations, like I said.

You can try to persuade Adapt to depend on FillArrays, or the reverse, but I think you will have a hard time.

Anyway I think not changing sum(::CuArray) is the conservative position here. This PR tries to make sure that cpu follows that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't changing the sum method here.

else
# Allows both cpu(x::CuArray) and cpu(x::Adjoint{T,CuArray}):
return y, dy -> (Zygote.ChainRules.NoTangent(), _gpu_array(dy))
end
end

_isbitsarray(::AbstractArray{<:Number}) = true
_isbitsarray(::AbstractArray{T}) where T = isbitstype(T)
Expand All @@ -99,8 +112,7 @@ Moves `m` to the current GPU device, if available. It is a no-op otherwise.
See the [CUDA.jl docs](https://juliagpu.github.io/CUDA.jl/stable/usage/multigpu/)
to help identify the current device.

This works for functions and
any struct with [`@functor`](@ref) defined.
This works for functions, and any struct marked with [`@functor`](@ref).

```julia-repl
julia> m = Dense(1,2)
Expand All @@ -116,7 +128,27 @@ julia> typeof(m_gpu.W) # notice the type of the array changed to a CuArray
CuArray{Float32, 2}
```
"""
gpu(x) = use_cuda[] ? fmap(CUDA.cu, x; exclude = _isbitsarray) : x
gpu(x) = use_cuda[] ? fmap(_gpu_array, x; exclude = _isbitsarray) : x

_gpu_array(x::AbstractArray) = CUDA.cu(x)

# While `cu` moves Arrays to the GPU, we also want to move some structured arrays
# https://github.com/FluxML/Zygote.jl/issues/1005
_gpu_array(x::Zygote.FillArrays.AbstractFill) = CUDA.fill(first(x), size(x)) # gradient of sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the behaviour of CUDA.cu and will eagerly materialise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, CUDA.cu deliberately is untouched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu doesn't mirror cu anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that's why "changing the behaviour of CUDA.cu" is incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior is that it should.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want new behaviour, and not to pirate CUDA, and the same behaviour? I hope it's clear that you cannot have all three. This PR makes a choice. (Not the choice you claimed, though.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that there is no new behaviour. There is a bug that needs to be identified with the cu conversions. @maleadt would be able to say.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not. The only bug reports that may have been related were inconclusive (the Flux issue just showing a Zygote error) or caused by other issues (RecursiveArrayTools missing an essential constructor).

function _gpu_array(x::Zygote.OneElement) # gradient of getindex
y = CUDA.zeros(eltype(x), size(x))
CUDA.@allowscalar y[x.ind...] = x.val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it may be too easy to trigger scalar operations silently now. Reporting that they are happening is better for performance and proper use of CUDA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is precisely to solve the issue linked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, events of scalar indexing on GPU are reported as errors precisely because they don't work with the way a GPU is expected to work. This is bypassing that assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing roughly the same thing as https://github.com/FluxML/Zygote.jl/pull/880/files#diff-1040a6fce812f91964b258de2a02fb69296c75a9701c3326df2671ab9ea7e5f0R284 . I don't see loud objections that that should have been left an error because that's how "GPU is expected to work".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew OneElement was limited benefit over generalised accum, but now I'm wondering if it can silently not report scalar indexing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it can silently not report scalar indexing

Yes, that's what @allowscalar does. Just like vcat does.

y
end

function Zygote.ChainRules.rrule(::typeof(_gpu_array), x::AbstractArray)
y = _gpu_array(x)
if x === y # trivial case, e.g. gpu(x::Adjoint{T,CuArray})
return y, dy -> (Zygote.ChainRules.NoTangent(), dy)
else
return y, dy -> (Zygote.ChainRules.NoTangent(), _cpu_array(dy))
end
end

# Precision

Expand Down
41 changes: 41 additions & 0 deletions test/cuda/cuda.jl
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,44 @@ end
@test gpu((;a=[SimpleBits(1)])).a isa CuVector{SimpleBits}
end
end

@testset "gpu(cpu(x)) inside gradient" begin
a = randn(Float32, 4, 4)
ca = cu(a)

# Trivial functions
@test gradient(x -> sum(abs, gpu(x)), a)[1] isa Matrix
@test gradient(x -> sum(gpu(x)), a)[1] isa Matrix
@test gradient(x -> sum(gpu(x)), a')[1] isa Matrix
@test gradient(x -> sum(abs, cpu(x)), ca)[1] isa CuArray
@test gradient(x -> sum(cpu(x)), ca)[1] isa CuArray # This involves FillArray
@test gradient(x -> sum(cpu(x)), ca')[1] isa CuArray

# Even more trivial: no movement
@test gradient(x -> sum(abs, cpu(x)), a)[1] isa Matrix
@test gradient(x -> sum(abs, cpu(x)), a')[1] isa Matrix
@test gradient(x -> sum(cpu(x)), a)[1] isa FillArrays.Fill
@test gradient(x -> sum(abs, gpu(x)), ca)[1] isa CuArray
@test_skip gradient(x -> sum(abs, gpu(x)), ca')[1] isa CuArray

# More complicated, Array * CuArray is an error
g0 = gradient(x -> sum(abs, (a * (a * x))), a)[1]
@test g0 ≈ gradient(x -> sum(abs, cpu(ca * gpu(a * x))), a)[1]
@test cu(g0) ≈ gradient(x -> sum(abs, gpu(a * cpu(ca * x))), ca)[1]

# Scalar indexing of an array, needs OneElement to transfer to GPU
# https://github.com/FluxML/Zygote.jl/issues/1005
@test gradient(x -> cpu(2 .* gpu(x))[1], Float32[1,2,3]) == ([2,0,0],)
@test gradient(x -> cpu(gpu(x) * gpu(x))[1,2], Float32[1 2 3; 4 5 6; 7 8 9]) == ([2 6 8; 0 2 0; 0 3 0],)

# Explicit pieces. It's not entirely clear that it's desirable to move these if they appear alone,
# but it's necessary to move them if they appear in gradient of cpu(::CuArray), as in the
# examples above. Those must not break, but a re-design could perhaps change these:
g1 = Zygote.OneElement(1, (2,3), axes(ones(4,5)))
@test gpu(g1) isa CuArray
@test gpu(g1) ≈ cu(Matrix(g1))

g2 = Fill(1f0,2)
@test gpu(g2) isa CuArray
@test gpu(g2) ≈ cu(Vector(g2))
end