Add tests for GPU distributed DatasetRestoring#694
Add tests for GPU distributed DatasetRestoring#694simone-silvestri wants to merge 15 commits intomainfrom
Conversation
Remove unnecessary blank line in ocean_simulation.jl
…ss/test-dataset-restoring
…ss/test-dataset-restoring
|
Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(f::Symbol, x::Float64)
@ Base.Math ./math.jl:33
[2] sqrt
@ ./math.jl:686 [inlined]
[3] sqrt(x::Int64)
@ Base.Math ./math.jl:1578
[4] top-level scope
@ REPL[1]:1error. |
whoa. I wonder how many times I have seen this. wow. |
I was mind-blown as well... |
|
I tried writing up a MWE, but I get NaNs... Maybe it's how these NaNs are propagated that generate the CUDA illegal memory access? julia> using KernelAbstractions, CUDA
julia> @kernel function negative_sqrt!(a)
i = @index(Global, Linear)
@inbounds a[i] = sqrt(-1)
end
julia> a = zeros(5);
julia> loop! = negative_sqrt!(KernelAbstractions.CPU(), 5, 5)
KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(cpu_negative_sqrt!)}(CPU(false), cpu_negative_sqrt!)
julia> loop!(a)
ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(f::Symbol, x::Float64)
@ Base.Math ./math.jl:33
[2] sqrt
@ ./math.jl:627 [inlined]
[3] sqrt(x::Int64)
@ Base.Math ./math.jl:1546
[4] macro expansion
@ ~/.julia/packages/KernelAbstractions/X5fk1/src/macros.jl:314 [inlined]
[5] cpu_negative_sqrt!(__ctx__::KernelAbstractions.CompilerMetadata{…}, a::Vector{…})
@ Main ./none:0
[6] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
@ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:145
[7] __run(obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck, static_threads::Bool)
@ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:112
[8] #_#20
@ ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:46 [inlined]
[9] (::KernelAbstractions.Kernel{…})(args::Vector{…})
@ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:39
[10] top-level scope
@ REPL[8]:1
Some type information was truncated. Use `show(err)` to see complete types.
julia> a = CuArray(a);
julia> loop! = negative_sqrt!(CUDA.CUDABackend(), 5, 5)
KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(gpu_negative_sqrt!)}(CUDABackend(false, false), gpu_negative_sqrt!)
julia> loop!(a)
julia> a
5-element CuArray{Float64, 1, CUDA.DeviceMemory}:
NaN
NaN
NaN
NaN
NaN |
I am having some issues with
DatasetRestoringon multiple GPUs. Theupdate_model_field_time_series!function crashes (deterministically) with CUDA illegal memory access connected to Oceananigans'cpu_interpolating_time_indices.I am trying to debug this as it is halting the OMIP progress, however, in the meantime I am adding a dataset restoring GPU test for multi-GPU here to see if I can reproduce the error.
I think the tests need a bit of an overhaul.