Generating dataset on Summit: GPU last error detected #4438

Change72 · 2023-11-22T18:57:49Z

Hi,

I am working on the Summit and followed the instructions of Summit(OLCF).

I successfully finished the compilation of Warpx. However, when I try to use the script V100 GPUs (16GB). It shows "amrex::Abort::1::GPU last error detected in file /ccs/home/chang/src/warpx/build_summit/_deps/fetchedamrex-src/Src/Base/AMReX_GpuLaunchFunctsG.H line 1132: invalid device function !!!
SIGABRT"

I think it might be caused by the version of Cuda since quokka #21. I tried "module load cuda/11.7.1" as well as other cuda version, but the error persists.

ax3l · 2023-11-28T18:45:36Z

Hi @Change72,

Thank you for your report.

Can you please share your detailed input + submission script, so we can reproduce it in our envs?

Is the same warpx.profile loaded and active when you submit your job?

note that we use cuda/11.3.1, which should not see the issue you linked (11.6).

Another small unrelated detail: your job seems to use 12 GPUs (2 nodes?) but your setup only has 8 grid blocks, so 4 GPUs will be unused right now.

Change72 · 2023-11-28T20:04:31Z

Hi @ax3l

Thank you for your help. I create a public GitHub repo to show my input and submission script.

Plus, I first used cuda/11.3.1 and got the same error. Then I try to find online solutions and try cuda/11.7.1, which is not working as well.

As for the input file, I have tested it on our own server and it can generate a small dataset smoothly.

ax3l · 2023-11-29T02:36:35Z

Hi @Change72,

Thank you for the details. I compiled your example and can reproduce the issue. Digging further...

berceanu · 2023-12-03T00:14:57Z

@ax3l, I can reproduce the error on our local cluster at ELI-NP
amrex::Abort::12::GPU last error detected in file /data/storage/berceanu/src/warpx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuLaunchFunctsG.H line 1132: invalid device function !!! via CUDA 11.4.4 on V100. I checked various submission scripts from the laser_acceleration examples.

Here is the spack env file, for reproducibility. PIConGPU seems to work with the same env file btw, suggesting that this is probably not a problem with the dependencies.

On Karolina, with CUDA 11.7.0 on A100, warpx seems to work file (same commit and using this spack env)

ax3l · 2023-12-04T23:14:46Z

I wonder if it compiles for the wrong SM (a newer one than Volta) or if a dependency was built with support for the wrong SM. Looks like a local issue on a few clusters, not universal.

berceanu · 2023-12-04T23:16:33Z

Looks like a local issue on a few clusters

Do all those clusters have V100's?

ax3l · 2023-12-04T23:35:14Z

This issue is referring to Summit - what cluster are you referring to?

WeiqunZhang · 2023-12-05T00:13:22Z

@Change72 Do you have those Backtrace.* files?

Change72 · 2023-12-05T00:24:42Z

@WeiqunZhang Yes. I just updated my GitHub repo.

Change72 · 2023-12-05T00:27:16Z

@berceanu Summit use V100.
@ax3l It seems the problem is caused by V100. Summit and @berceanu's local cluster both use V100.
berceanu said "On Karolina, with CUDA 11.7.0 on A100", so A100 is fine.

WeiqunZhang · 2023-12-05T00:36:31Z

It died in the FiniteDifferenceSolver::EvolveB function in WarpX/Source/FieldSolver/FiniteDifferenceSolver/EvolveB.cpp. Could you try to move that function to the end of that file? If that makes EvolveB work, you will need to make a similar change in EvolveE.cpp.

Change72 · 2023-12-06T01:12:41Z

@WeiqunZhang I move both functions and the error message changes. My changes is in my Warpx Github..

The latest output is in a new branch: 20231205

ax3l · 2023-12-07T17:40:44Z

Looks like this is still the same error.

On Summit, I ran:

cuobjdump bin/warpx.3d.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.GENQEDTABLES

and likewise for libs we build in lib/. The SM is correctly set at sm_70 for V100 GPUs.

GCC 9.3.0 is used with NVCC 11.3.1, which are documented as compatible: https://gist.github.com/ax3l/9489132

ax3l · 2023-12-07T17:44:18Z

Suspecting a compiler bug, I will try to switch to cuda/11.7.1 next...

berceanu · 2023-12-13T07:29:35Z

Any luck with this?

ax3l · 2023-12-17T20:37:43Z

Tests with latest development 9af087d using the CUDA backend and turning PSATD on for compile (everything else on default):

Summit (OLCF w/ V100 on ppc64le; RHEL 8.2) test with GCC 9.3.0 & NVCC 11.7.99 (CUDA 11.7.1):

No problem.

-> Could be a NVCC issue. We know that NVCC <11.7 is pretty buggy for C++17 code.

ax3l · 2023-12-17T20:40:11Z

ping @berceanu is there a CUDA 11.7 or newer on Karolina you can document?
-> ✔️ #4477

ping @lucafedeli88 is there a CUDA 11.7 or newer on Lenoardo you can document?

ping @AlexanderSinn is there a CUDA 11.7 or newer on Juwels you can document?

ping @Change72 fix for Summit is in #4538 :)

Change72 · 2023-12-18T22:31:34Z

Thanks @ax3l It works on Summit now, even though this is the last day of Summit. I will turn to Frontier later. If possible, could you try it on Frontier as well? I need some time for my Frontier application.

Change72 · 2023-12-18T22:38:15Z

Another thing I am a little puzzled about: I tried Cuda 11.7.1 before, but it was not working. Are there any other changes except 11.7.1?

ax3l added question Further information is requested machine / system Machine or system-specific issue labels Nov 28, 2023

ax3l assigned ax3l and WeiqunZhang Dec 5, 2023

ax3l mentioned this issue Dec 17, 2023

Doc: CUDA 11.7+ #4538

Merged

ax3l closed this as completed in #4538 Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating dataset on Summit: GPU last error detected #4438

Generating dataset on Summit: GPU last error detected #4438

Change72 commented Nov 22, 2023

ax3l commented Nov 28, 2023 •

edited

Loading

Change72 commented Nov 28, 2023

ax3l commented Nov 29, 2023

berceanu commented Dec 3, 2023 •

edited

Loading

ax3l commented Dec 4, 2023 •

edited

Loading

berceanu commented Dec 4, 2023 •

edited

Loading

ax3l commented Dec 4, 2023

WeiqunZhang commented Dec 5, 2023

Change72 commented Dec 5, 2023

Change72 commented Dec 5, 2023

WeiqunZhang commented Dec 5, 2023

Change72 commented Dec 6, 2023

ax3l commented Dec 7, 2023 •

edited

Loading

ax3l commented Dec 7, 2023 •

edited

Loading

berceanu commented Dec 13, 2023

ax3l commented Dec 17, 2023

ax3l commented Dec 17, 2023 •

edited

Loading

Change72 commented Dec 18, 2023

Change72 commented Dec 18, 2023

Generating dataset on Summit: GPU last error detected #4438

Generating dataset on Summit: GPU last error detected #4438

Comments

Change72 commented Nov 22, 2023

ax3l commented Nov 28, 2023 • edited Loading

Change72 commented Nov 28, 2023

ax3l commented Nov 29, 2023

berceanu commented Dec 3, 2023 • edited Loading

ax3l commented Dec 4, 2023 • edited Loading

berceanu commented Dec 4, 2023 • edited Loading

ax3l commented Dec 4, 2023

WeiqunZhang commented Dec 5, 2023

Change72 commented Dec 5, 2023

Change72 commented Dec 5, 2023

WeiqunZhang commented Dec 5, 2023

Change72 commented Dec 6, 2023

ax3l commented Dec 7, 2023 • edited Loading

ax3l commented Dec 7, 2023 • edited Loading

berceanu commented Dec 13, 2023

ax3l commented Dec 17, 2023

Summit (OLCF w/ V100 on ppc64le; RHEL 8.2) test with GCC 9.3.0 & NVCC 11.7.99 (CUDA 11.7.1):

ax3l commented Dec 17, 2023 • edited Loading

Change72 commented Dec 18, 2023

Change72 commented Dec 18, 2023

ax3l commented Nov 28, 2023 •

edited

Loading

berceanu commented Dec 3, 2023 •

edited

Loading

ax3l commented Dec 4, 2023 •

edited

Loading

berceanu commented Dec 4, 2023 •

edited

Loading

ax3l commented Dec 7, 2023 •

edited

Loading

ax3l commented Dec 7, 2023 •

edited

Loading

ax3l commented Dec 17, 2023 •

edited

Loading