-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with a mixed CPU/GPU model #1695
Comments
This is not a direct solution to your problem, but CUDA.jl just released support for unified memory: https://juliagpu.org/post/2021-08-13-cuda_3.4/#arrays_with_unified_memory. Slightly modifying the example above: using Flux, CUDA
# Shamelessly adapted from https://github.com/FluxML/Flux.jl/blob/v0.12.6/src/functor.jl#L71
unified_gpu(x) = fmap(x -> cu(x, unified=true), x; exclude = Flux._isbitsarray)
gpu_part = Chain(Conv((5, 5), 1=>6, relu),#actual part that runs on GPU is much more expensive and is very slow on CPU
MaxPool((2, 2)),
Conv((5, 5), 6=>16, relu),
MaxPool((2, 2)),
flatten,
Dense(1296, 120, relu),
Dense(120, 84, relu),
Dense(84, 1)) |> unified_gpu
theta = params(gpu_part)
cpu_part(x) = sum(x .^ 2) # x will be a `CuArray`, but should be more amenable to scalar indexing.
loss(x) = cpu_part(gpu_part(x)) # no explicit movement
data = randn(Float32, 51, 51, 1, 1) |> unified_gpu
loss(data)
gradient(() -> loss(data), theta) Edit: SciML/DiffEqFlux.jl#571 may be relevant here. |
Thanks, @ToucheSir! This is cool and good to know. Unfortunately, I still get a slightly different but similar error in my original code when using the |
Yeah, the error is to be expected because it's coming from dispatch. You have to remove the explicit conversions as I did for the unified array code path to work. As for how to make this specific example work with explicit data movement and no unified memory, I'll have to defer to others here. |
Interesting this fails, I think Here's the crudest possible way to add gradients definitions which move data the opposite way, which appears to fix your example above:
I say "crude" because the actual code for Lines 66 to 71 in 87a0065
|
Thanks, @mcabbott! This fixed the example and the original code. I think the adjoints need a minor tweak though: Zygote.@adjoint gpu(x) = gpu(x), dx -> (cpu(dx),)
Zygote.@adjoint cpu(x) = cpu(x), dx -> (gpu(dx),) That is, the pullback should move |
It is quite surprising we don't have the cpu and gpu adjoints. @omalled would you file a PR? |
Fwiw, it's one of the central missions of dP to not have adjoints over every operation. That's also how things like Jax and enzyme approach things, and something that Zygote was built for. |
After playing around a bit, julia> gradient(x -> sum(cu(x)), rand(Float32, 10))
ERROR: this intrinsic must be compiled to be called
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0 [inlined]
[2] _pullback(::Zygote.Context, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:9
[3] _pullback
@ ./atomics.jl:358 [inlined]
[4] _pullback(ctx::Zygote.Context, f::typeof(getindex), args::Base.Threads.Atomic{Int64})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
[5] _pullback
@ ~/.julia/packages/CUDA/VGl9W/lib/utils/threading.jl:46 [inlined]
[6] _pullback(::Zygote.Context, ::CUDA.APIUtils.var"##get!#4", ::Nothing, ::typeof(get!), ::CUDA.var"#162#163", ::CUDA.APIUtils.LazyInitialized{Vector{Bool}})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
[7] _pullback
@ ~/.julia/packages/CUDA/VGl9W/lib/utils/threading.jl:46 [inlined]
[8] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:69 [inlined]
[9] _pullback(ctx::Zygote.Context, f::typeof(CUDA.stream_ordered), args::CuDevice)
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
[10] macro expansion
@ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:222 [inlined]
[11] macro expansion
@ ./timing.jl:287 [inlined]
[12] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:220 [inlined]
[13] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:217 [inlined]
[14] _pullback(::Zygote.Context, ::CUDA.var"#_alloc##kw", ::NamedTuple{(:stream,), Tuple{Nothing}}, ::typeof(CUDA._alloc), ::Type{CUDA.Mem.DeviceBuffer}, ::Int64)
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
[15] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:207 [inlined]
[16] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/pool.jl:203 [inlined]
[17] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/array.jl:44 [inlined]
[18] _pullback(::Zygote.Context, ::Type{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::UndefInitializer, ::Tuple{Int64})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
[19] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/array.jl:270 [inlined]
[20] _pullback
@ ~/.julia/packages/CUDA/VGl9W/src/array.jl:429 [inlined]
[21] _pullback(::Zygote.Context, ::typeof(Adapt.adapt_storage), ::CUDA.CuArrayAdaptor{CUDA.Mem.DeviceBuffer}, ::Vector{Float32})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
[22] _pullback
@ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:42 [inlined]
[23] _pullback
@ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:40 [inlined]
[24] _pullback (repeats 2 times)
@ ~/.julia/packages/CUDA/VGl9W/src/array.jl:439 [inlined]
[25] _pullback
@ ./REPL[6]:1 [inlined]
[26] _pullback(ctx::Zygote.Context, f::var"#5#6", args::Vector{Float32})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface2.jl:0
[27] _pullback(f::Function, args::Vector{Float32})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface.jl:34
[28] pullback(f::Function, args::Vector{Float32})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface.jl:40
[29] gradient(f::Function, args::Vector{Float32})
@ Zygote ~/.julia/packages/Zygote/l3aNG/src/compiler/interface.jl:75
[30] top-level scope
@ REPL[6]:1
[31] top-level scope
@ ~/.julia/packages/CUDA/VGl9W/src/initialization.jl:66 I think the most surgical change we could make here is adding an |
Since cpu/gpu act on whole Functors trees, but pass Adapt the arrays, this might be the right level. But Adapt doesn't seem to convert things which aren't dense arrays. For example, one of my examples here FluxML/Zygote.jl#1005 (comment) makes a FillArray on the CPU, which ideally would (I think) be adapted back to a CuArray, since operations mixing Fill and CuArray tend to go wrong. But:
So perhaps there needs to be one more step, via a function which Flux (or Zygote) owns, which is roughly Attaching a gradient to that would also avoid piracy issues with Adapt. |
SciML/DiffEqFlux.jl#608 (comment) found that some behaviour in |
I can confirm |
We need both directions though. Changes to |
Using the same Adapt framework, both of |
I think @maleadt would be best to say what happened here and advise on the correct fix #1695 (comment) |
What exactly is CUDA.jl doing wrong? The linked issue DiffEqFlux issue showed that there were issues with RecursiveArrayTools, not with the GPU stack. |
I believe the switch in JuliaGPU/CUDA.jl@748e49d from |
Or was it the additional typevar (also added in that PR) that was a problem? FluxML/Zygote.jl#1062 |
That too! I'd argue #1062 manifests because https://github.com/FluxML/Zygote.jl/blob/master/src/lib/broadcast.jl#L281-L283 isn't being hit anymore, however (just confirmed using Cthulhu). Otherwise it may gave gone silently unnoticed for a little while longer. |
I have a model that needs to run partly on the CPU and partly on the GPU. The part that needs to run on the CPU simply cannot run on the GPU. The part that needs to run on the GPU is far too slow when run on the CPU. In this case, it seems to be worth paying the cost of moving data between the CPU and GPU during the model runs. However, I'm having trouble getting the training process to work and the problem seems to be in the backward pass. Here's the closest thing to an MWE that I could come up with:
This code runs fine on a machine with no GPU. On a machine with a GPU, it gives the following error:
Can a combined GPU/CPU workflow like this work? The error I was getting with my original code was actually different than this, but they're both caused by doing some operation with a mixture of CPU and GPU arrays that isn't supported. Thanks for your help!
The text was updated successfully, but these errors were encountered: