Skip to content

Add tests for GPU distributed DatasetRestoring#694

Open
simone-silvestri wants to merge 15 commits intomainfrom
ss/test-dataset-restoring
Open

Add tests for GPU distributed DatasetRestoring#694
simone-silvestri wants to merge 15 commits intomainfrom
ss/test-dataset-restoring

Conversation

@simone-silvestri
Copy link
Collaborator

I am having some issues with DatasetRestoring on multiple GPUs. The update_model_field_time_series! function crashes (deterministically) with CUDA illegal memory access connected to Oceananigans' cpu_interpolating_time_indices.

I am trying to debug this as it is halting the OMIP progress, however, in the meantime I am adding a dataset restoring GPU test for multi-GPU here to see if I can reproduce the error.

I think the tests need a bit of an overhaul.

@simone-silvestri
Copy link
Collaborator Author

Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a

ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:686 [inlined]
 [3] sqrt(x::Int64)
   @ Base.Math ./math.jl:1578
 [4] top-level scope
   @ REPL[1]:1

error.
Apparently, this error is not shown on the GPU, but it corrupts the GPU memory which then eventually spits out the CUDA Illegal memory access.

@glwagner
Copy link
Member

glwagner commented Dec 4, 2025

Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a

ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:686 [inlined]
 [3] sqrt(x::Int64)
   @ Base.Math ./math.jl:1578
 [4] top-level scope
   @ REPL[1]:1

error. Apparently, this error is not shown on the GPU, but it corrupts the GPU memory which then eventually spits out the CUDA Illegal memory access.

whoa. I wonder how many times I have seen this. wow.

@navidcy
Copy link
Member

navidcy commented Dec 5, 2025

Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a

ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:686 [inlined]
 [3] sqrt(x::Int64)
   @ Base.Math ./math.jl:1578
 [4] top-level scope
   @ REPL[1]:1

error. Apparently, this error is not shown on the GPU, but it corrupts the GPU memory which then eventually spits out the CUDA Illegal memory access.

whoa. I wonder how many times I have seen this. wow.

I was mind-blown as well...

@simone-silvestri
Copy link
Collaborator Author

I tried writing up a MWE, but I get NaNs... Maybe it's how these NaNs are propagated that generate the CUDA illegal memory access?

julia> using KernelAbstractions, CUDA

julia> @kernel function negative_sqrt!(a)
          i = @index(Global, Linear)
          @inbounds a[i] = sqrt(-1)
       end

julia> a = zeros(5);

julia> loop! = negative_sqrt!(KernelAbstractions.CPU(), 5, 5)
KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(cpu_negative_sqrt!)}(CPU(false), cpu_negative_sqrt!)

julia> loop!(a)
ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
  [1] throw_complex_domainerror(f::Symbol, x::Float64)
    @ Base.Math ./math.jl:33
  [2] sqrt
    @ ./math.jl:627 [inlined]
  [3] sqrt(x::Int64)
    @ Base.Math ./math.jl:1546
  [4] macro expansion
    @ ~/.julia/packages/KernelAbstractions/X5fk1/src/macros.jl:314 [inlined]
  [5] cpu_negative_sqrt!(__ctx__::KernelAbstractions.CompilerMetadata{…}, a::Vector{…})
    @ Main ./none:0
  [6] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:145
  [7] __run(obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck, static_threads::Bool)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:112
  [8] #_#20
    @ ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:46 [inlined]
  [9] (::KernelAbstractions.Kernel{…})(args::Vector{…})
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:39
 [10] top-level scope
    @ REPL[8]:1
Some type information was truncated. Use `show(err)` to see complete types.

julia> a = CuArray(a);

julia> loop! = negative_sqrt!(CUDA.CUDABackend(), 5, 5)
KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(gpu_negative_sqrt!)}(CUDABackend(false, false), gpu_negative_sqrt!)

julia> loop!(a)

julia> a
5-element CuArray{Float64, 1, CUDA.DeviceMemory}:
 NaN
 NaN
 NaN
 NaN
 NaN

@navidcy navidcy added the tests Helpful for getting a good night's sleep label Jan 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Helpful for getting a good night's sleep

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants