Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with a mixed CPU/GPU model #1695

Closed
omalled opened this issue Aug 20, 2021 · 18 comments · Fixed by #1704
Closed

Problems with a mixed CPU/GPU model #1695

omalled opened this issue Aug 20, 2021 · 18 comments · Fixed by #1704

Comments

@omalled
Copy link

omalled commented Aug 20, 2021

I have a model that needs to run partly on the CPU and partly on the GPU. The part that needs to run on the CPU simply cannot run on the GPU. The part that needs to run on the GPU is far too slow when run on the CPU. In this case, it seems to be worth paying the cost of moving data between the CPU and GPU during the model runs. However, I'm having trouble getting the training process to work and the problem seems to be in the backward pass. Here's the closest thing to an MWE that I could come up with:

using Flux
gpu_part = Chain(Conv((5, 5), 1=>6, relu),#actual part that runs on GPU is much more expensive and is very slow on CPU
              MaxPool((2, 2)),
              Conv((5, 5), 6=>16, relu),
              MaxPool((2, 2)),
              flatten,
              Dense(1296, 120, relu),
              Dense(120, 84, relu),
              Dense(84, 1)) |> gpu
theta = params(gpu_part)
cpu_part(x) = sum(x .^ 2)#actual part that runs on CPU is much more expensive and cannot run on GPU
loss(x) = cpu_part(cpu(gpu_part(gpu(x))))#explicitly moves data to the GPU and back
data = [(randn(Float32, 51, 51, 1, 1),) for i = 1:1]
loss(data[1]...)
Flux.train!(loss, theta, data, ADAM())

This code runs fine on a machine with no GPU. On a machine with a GPU, it gives the following error:

ERROR: LoadError: ArgumentError: cannot take the CPU address of a CUDA.CuArray{Float32, 2}
Stacktrace:
  [1] unsafe_convert(#unused#::Type{Ptr{Float32}}, x::CUDA.CuArray{Float32, 2})
    @ CUDA ~/.julia/packages/CUDA/mVgLI/src/array.jl:262
  [2] gemm!(transA::Char, transB::Char, alpha::Float32, A::Matrix{Float32}, B::CUDA.CuArray{Float32, 2}, beta::Float32, C::CUDA.CuArray{Float32, 2})
    @ LinearAlgebra.BLAS /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/blas.jl:1458
  [3] gemm_wrapper!(C::CUDA.CuArray{Float32, 2}, tA::Char, tB::Char, A::Matrix{Float32}, B::CUDA.CuArray{Float32, 2}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:671
  [4] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:392 [inlined]
  [5] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:464 [inlined]
  [6] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
  [7] *
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:153 [inlined]
  [8] #1525
    @ ~/.julia/packages/ChainRules/SoDZe/src/rulesets/Base/arraymath.jl:31 [inlined]
  [9] Thunk
    @ ~/.julia/packages/ChainRulesCore/qbmEe/src/differentials/thunks.jl:98 [inlined]
 [10] unthunk
    @ ~/.julia/packages/ChainRulesCore/qbmEe/src/differentials/thunks.jl:99 [inlined]
 [11] unthunk
    @ ~/.julia/packages/ChainRulesCore/qbmEe/src/differentials/thunks.jl:120 [inlined]
 [12] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/zowrf/src/compiler/chainrules.jl:41 [inlined]
 [13] map
    @ ./tuple.jl:215 [inlined]
 [14] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/zowrf/src/compiler/chainrules.jl:42 [inlined]
 [15] ZBack
    @ ~/.julia/packages/Zygote/zowrf/src/compiler/chainrules.jl:77 [inlined]
 [16] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:147 [inlined]
 [17] (::typeof(∂(λ)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [18] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [19] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [20] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [21] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [22] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [23] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [24] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [25] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [26] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [27] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [28] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [29] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [30] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [31] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [32] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [33] (::typeof(∂(applychain)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [34] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:38 [inlined]
 [35] (::typeof(∂(λ)))(Δ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [36] Pullback
    @ ~/blah/gpu_cpu_ad_error/ex.jl:12 [inlined]
 [37] (::typeof(∂(loss)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [38] #188
    @ ~/.julia/packages/Zygote/zowrf/src/lib/lib.jl:194 [inlined]
 [39] #1689#back
    @ ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59 [inlined]
 [40] Pullback
    @ ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:102 [inlined]
 [41] (::typeof(∂(λ)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface2.jl:0
 [42] (::Zygote.var"#69#70"{Zygote.Params, typeof(∂(λ)), Zygote.Context})(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface.jl:255
 [43] gradient(f::Function, args::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/zowrf/src/compiler/interface.jl:59
 [44] macro expansion
    @ ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:101 [inlined]
 [45] macro expansion
    @ ~/.julia/packages/Juno/n6wyj/src/progress.jl:134 [inlined]
 [46] train!(loss::Function, ps::Zygote.Params, data::Vector{Tuple{Array{Float32, 4}}}, opt::ADAM; cb::Flux.Optimise.var"#40#46")
    @ Flux.Optimise ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:99
 [47] train!(loss::Function, ps::Zygote.Params, data::Vector{Tuple{Array{Float32, 4}}}, opt::ADAM)
    @ Flux.Optimise ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:97
 [48] top-level scope
    @ ~/blah/gpu_cpu_ad_error/ex.jl:15
 [49] include(fname::String)
    @ Base.MainInclude ./client.jl:444
 [50] top-level scope
    @ REPL[1]:1
in expression starting at /home/omalled/blah/gpu_cpu_ad_error/ex.jl:15

Can a combined GPU/CPU workflow like this work? The error I was getting with my original code was actually different than this, but they're both caused by doing some operation with a mixture of CPU and GPU arrays that isn't supported. Thanks for your help!

@ToucheSir
Copy link
Member

ToucheSir commented Aug 20, 2021

This is not a direct solution to your problem, but CUDA.jl just released support for unified memory: https://juliagpu.org/post/2021-08-13-cuda_3.4/#arrays_with_unified_memory. Slightly modifying the example above:

using Flux, CUDA

# Shamelessly adapted from https://github.com/FluxML/Flux.jl/blob/v0.12.6/src/functor.jl#L71
unified_gpu(x) = fmap(x -> cu(x, unified=true), x; exclude = Flux._isbitsarray)

gpu_part = Chain(Conv((5, 5), 1=>6, relu),#actual part that runs on GPU is much more expensive and is very slow on CPU
              MaxPool((2, 2)),
              Conv((5, 5), 6=>16, relu),
              MaxPool((2, 2)),
              flatten,
              Dense(1296, 120, relu),
              Dense(120, 84, relu),
              Dense(84, 1)) |> unified_gpu
theta = params(gpu_part)

cpu_part(x) = sum(x .^ 2) # x will be a `CuArray`, but should be more amenable to scalar indexing.
loss(x) = cpu_part(gpu_part(x)) # no explicit movement
data = randn(Float32, 51, 51, 1, 1) |> unified_gpu
loss(data)
gradient(() -> loss(data), theta)

Edit: SciML/DiffEqFlux.jl#571 may be relevant here.

@omalled
Copy link
Author

omalled commented Aug 21, 2021

Thanks, @ToucheSir! This is cool and good to know. Unfortunately, I still get a slightly different but similar error in my original code when using the unified_gpu trick. I wonder if part of what is making it work in this simplified example is that the whole computation is done on the GPU when using unified_gpu. Also, good to know about the issue in DiffEqFlux. Thanks again!

@ToucheSir
Copy link
Member

Yeah, the error is to be expected because it's coming from dispatch. You have to remove the explicit conversions as I did for the unified array code path to work. As for how to make this specific example work with explicit data movement and no unified memory, I'll have to defer to others here.

@mcabbott
Copy link
Member

mcabbott commented Aug 22, 2021

loss(x) = cpu_part(cpu(gpu_part(gpu(x))))

Interesting this fails, I think cpu and gpu should get straightforward dense arrays here, and don't seem to move their gradients. When trying to fix FluxML/Zygote.jl#1005 by making it materialise a CPU-only struct to a dense CuArray, I think I assumed this surely worked and could be hooked into, but maybe there's nothing?

Here's the crudest possible way to add gradients definitions which move data the opposite way, which appears to fix your example above:

julia> using CUDA, Zygote, Flux

julia> CUDA.allowscalar(false)

julia> a = rand(Float32, 4, 4); ca = cu(rand(4, 4));

julia> gradient(x -> sum(abs, cpu(ca * gpu(a * x))), a)
ERROR: ArgumentError: cannot take the CPU address of a CuArray{Float32, 2}

julia> gradient(x -> sum(cpu(ca * gpu(a * x))), a)  # FillArray from sum
ERROR: Scalar indexing is disallowed.

julia> Zygote.@adjoint gpu(x) = gpu(x), dx -> (cpu(x),)

julia> Zygote.@adjoint cpu(x) = cpu(x), dx -> (gpu(x),)

julia> gradient(x -> sum(abs, cpu(ca * gpu(a * x))), a)
(Float32[1.6273962 0.50988764 1.2790675 2.2093036; 0.37633207 0.113470785 0.3003524 0.53487664; 1.6206458 0.507219 1.2899909 2.1595078; 2.4757724 0.81120884 1.8731096 3.2701323],)

julia> gradient(x -> sum(cpu(ca * gpu(a * x))), a)
(Float32[1.6273962 0.50988764 1.2790675 2.2093036; 0.37633207 0.113470785 0.3003524 0.53487664; 1.6206458 0.507219 1.2899909 2.1595078; 2.4757724 0.81120884 1.8731096 3.2701323],)

I say "crude" because the actual code for gpu etc. uses fmap from Functors.jl, and one should at least think about what happens if you apply this to a whole model inside a gradient, not just an array. And what happens if you needlessly apply gpu to something which is already there, or you call cu instead, or go the other way with collect or Array. Ideally these would all work, I think.

Flux.jl/src/functor.jl

Lines 66 to 71 in 87a0065

cpu(m) = fmap(x -> adapt(Array, x), m)
_isbitsarray(::AbstractArray{<:Number}) = true
_isbitsarray(::AbstractArray{T}) where T = isbitstype(T)
_isbitsarray(x) = false
gpu(x) = use_cuda[] ? fmap(CUDA.cu, x; exclude = _isbitsarray) : x

@omalled
Copy link
Author

omalled commented Aug 23, 2021

Thanks, @mcabbott! This fixed the example and the original code. I think the adjoints need a minor tweak though:

Zygote.@adjoint gpu(x) = gpu(x), dx -> (cpu(dx),)
Zygote.@adjoint cpu(x) = cpu(x), dx -> (gpu(dx),)

That is, the pullback should move dx rather than x.

@CarloLucibello
Copy link
Member

It is quite surprising we don't have the cpu and gpu adjoints. @omalled would you file a PR?
Also, it would be better to use rrule instead of @adjoint.

@DhairyaLGandhi
Copy link
Member

Fwiw, it's one of the central missions of dP to not have adjoints over every operation. That's also how things like Jax and enzyme approach things, and something that Zygote was built for.

@ToucheSir
Copy link
Member

After playing around a bit, cu may be the main culprit here:

julia> gradient(x -> sum(cu(x)), rand(Float32, 10))
ERROR: this intrinsic must be compiled to be called
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0 [inlined]
  [2] _pullback(::Zygote.Context, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:9
  [3] _pullback
    @ ./atomics.jl:358 [inlined]
  [4] _pullback(ctx::Zygote.Context, f::typeof(getindex), args::Base.Threads.Atomic{Int64})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
  [5] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/lib/utils/threading.jl:46 [inlined]
  [6] _pullback(::Zygote.Context, ::CUDA.APIUtils.var"##get!#4", ::Nothing, ::typeof(get!), ::CUDA.var"#162#163", ::CUDA.APIUtils.LazyInitialized{Vector{Bool}})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
  [7] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/lib/utils/threading.jl:46 [inlined]
  [8] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:69 [inlined]
  [9] _pullback(ctx::Zygote.Context, f::typeof(CUDA.stream_ordered), args::CuDevice)
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
 [10] macro expansion
    @ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:222 [inlined]
 [11] macro expansion
    @ ./timing.jl:287 [inlined]
 [12] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:220 [inlined]
 [13] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:217 [inlined]
 [14] _pullback(::Zygote.Context, ::CUDA.var"#_alloc##kw", ::NamedTuple{(:stream,), Tuple{Nothing}}, ::typeof(CUDA._alloc), ::Type{CUDA.Mem.DeviceBuffer}, ::Int64)
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
 [15] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:207 [inlined]
 [16] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:203 [inlined]
 [17] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/array.jl:44 [inlined]
 [18] _pullback(::Zygote.Context, ::Type{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::UndefInitializer, ::Tuple{Int64})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
 [19] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/array.jl:270 [inlined]
 [20] _pullback
    @ ~/.julia/packages/CUDA/VGl9W/src/array.jl:429 [inlined]
 [21] _pullback(::Zygote.Context, ::typeof(Adapt.adapt_storage), ::CUDA.CuArrayAdaptor{CUDA.Mem.DeviceBuffer}, ::Vector{Float32})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
 [22] _pullback
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:42 [inlined]
 [23] _pullback
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:40 [inlined]
 [24] _pullback (repeats 2 times)
    @ ~/.julia/packages/CUDA/VGl9W/src/array.jl:439 [inlined]
 [25] _pullback
    @ ./REPL[6]:1 [inlined]
 [26] _pullback(ctx::Zygote.Context, f::var"#5#6", args::Vector{Float32})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
 [27] _pullback(f::Function, args::Vector{Float32})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface.jl:34
 [28] pullback(f::Function, args::Vector{Float32})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface.jl:40
 [29] gradient(f::Function, args::Vector{Float32})
    @ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface.jl:75
 [30] top-level scope
    @ REPL[6]:1
 [31] top-level scope
    @ ~/.julia/packages/CUDA/VGl9W/src/initialization.jl:66

I think the most surgical change we could make here is adding an rrule for Adapt.adapt_storage so that AD doesn't try going into the CuArray constructor.

@mcabbott
Copy link
Member

Since cpu/gpu act on whole Functors trees, but pass Adapt the arrays, this might be the right level.

But Adapt doesn't seem to convert things which aren't dense arrays. For example, one of my examples here FluxML/Zygote.jl#1005 (comment) makes a FillArray on the CPU, which ideally would (I think) be adapted back to a CuArray, since operations mixing Fill and CuArray tend to go wrong. But:

julia> adapt(CuArray, ones(1,3))
1×3 CuArray{Float64, 2}:
 1.0  1.0  1.0

julia> adapt(CuArray, 1:3)
1:3

julia> adapt(CuArray, Fill(1,2,3))
2×3 Fill{Int64}, with entries equal to 1

So perhaps there needs to be one more step, via a function which Flux (or Zygote) owns, which is roughly adapt but treats these things differently? It could just be a method of gpu(::AbsractArray).

Attaching a gradient to that would also avoid piracy issues with Adapt.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Aug 30, 2021

SciML/DiffEqFlux.jl#608 (comment) found that some behaviour in cu has in fact changed, so it would need to be fixed there. The Fill behavior has always been that iirc though.

@ToucheSir
Copy link
Member

I can confirm cu differentiates successfully with CUDA 3.3.6. Left a comment on the other thread about what the culprit might be.

@mcabbott
Copy link
Member

We need both directions though. Changes to cu may have changed how gpu behaves, but won't have changed how cpu behaves.

@DhairyaLGandhi
Copy link
Member

We need both directions though

Using the same Adapt framework, both of cpu and gpu would manage to check for trivial returns and conversions.

@DhairyaLGandhi
Copy link
Member

I think @maleadt would be best to say what happened here and advise on the correct fix #1695 (comment)

@maleadt
Copy link
Collaborator

maleadt commented Sep 6, 2021

SciML/DiffEqFlux.jl#608 (comment) found that some behaviour in cu has in fact changed, so it would need to be fixed there.

What exactly is CUDA.jl doing wrong? The linked issue DiffEqFlux issue showed that there were issues with RecursiveArrayTools, not with the GPU stack.

@ToucheSir
Copy link
Member

I believe the switch in JuliaGPU/CUDA.jl@748e49d from convert (appears to be AD-friendly) to calling the CuArray constructor (not AD-friendly) directly was the culprit. Tags say it was released in 3.4.

@maleadt
Copy link
Collaborator

maleadt commented Sep 6, 2021

Or was it the additional typevar (also added in that PR) that was a problem? FluxML/Zygote.jl#1062

@ToucheSir
Copy link
Member

ToucheSir commented Sep 6, 2021

That too! I'd argue #1062 manifests because https://github.com/FluxML/Zygote.jl/blob/master/src/lib/broadcast.jl#L281-L283 isn't being hit anymore, however (just confirmed using Cthulhu). Otherwise it may gave gone silently unnoticed for a little while longer.

@bors bors bot closed this as completed in 2468a06 Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants