Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating dataset on Summit: GPU last error detected #4438

Closed
Change72 opened this issue Nov 22, 2023 · 19 comments · Fixed by #4538
Closed

Generating dataset on Summit: GPU last error detected #4438

Change72 opened this issue Nov 22, 2023 · 19 comments · Fixed by #4538
Assignees
Labels
machine / system Machine or system-specific issue question Further information is requested

Comments

@Change72
Copy link

Hi,

I am working on the Summit and followed the instructions of Summit(OLCF).

I successfully finished the compilation of Warpx. However, when I try to use the script V100 GPUs (16GB). It shows "amrex::Abort::1::GPU last error detected in file /ccs/home/chang/src/warpx/build_summit/_deps/fetchedamrex-src/Src/Base/AMReX_GpuLaunchFunctsG.H line 1132: invalid device function !!!
SIGABRT"

I think it might be caused by the version of Cuda since quokka #21. I tried "module load cuda/11.7.1" as well as other cuda version, but the error persists.

Screenshot from 2023-11-22 13-56-54
Screenshot from 2023-11-22 13-56-34

@ax3l
Copy link
Member

ax3l commented Nov 28, 2023

Hi @Change72,

Thank you for your report.

Can you please share your detailed input + submission script, so we can reproduce it in our envs?

Is the same warpx.profile loaded and active when you submit your job?

note that we use cuda/11.3.1, which should not see the issue you linked (11.6).

Another small unrelated detail: your job seems to use 12 GPUs (2 nodes?) but your setup only has 8 grid blocks, so 4 GPUs will be unused right now.

@ax3l ax3l added question Further information is requested machine / system Machine or system-specific issue labels Nov 28, 2023
@Change72
Copy link
Author

Hi @ax3l

Thank you for your help. I create a public GitHub repo to show my input and submission script.

Plus, I first used cuda/11.3.1 and got the same error. Then I try to find online solutions and try cuda/11.7.1, which is not working as well.

As for the input file, I have tested it on our own server and it can generate a small dataset smoothly.

@ax3l
Copy link
Member

ax3l commented Nov 29, 2023

Hi @Change72,

Thank you for the details. I compiled your example and can reproduce the issue. Digging further...

@berceanu
Copy link
Contributor

berceanu commented Dec 3, 2023

@ax3l, I can reproduce the error on our local cluster at ELI-NP
amrex::Abort::12::GPU last error detected in file /data/storage/berceanu/src/warpx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuLaunchFunctsG.H line 1132: invalid device function !!! via CUDA 11.4.4 on V100. I checked various submission scripts from the laser_acceleration examples.

Here is the spack env file, for reproducibility. PIConGPU seems to work with the same env file btw, suggesting that this is probably not a problem with the dependencies.

On Karolina, with CUDA 11.7.0 on A100, warpx seems to work file (same commit and using this spack env)

@ax3l
Copy link
Member

ax3l commented Dec 4, 2023

I wonder if it compiles for the wrong SM (a newer one than Volta) or if a dependency was built with support for the wrong SM. Looks like a local issue on a few clusters, not universal.

@berceanu
Copy link
Contributor

berceanu commented Dec 4, 2023

Looks like a local issue on a few clusters

Do all those clusters have V100's?

@ax3l
Copy link
Member

ax3l commented Dec 4, 2023

This issue is referring to Summit - what cluster are you referring to?

@WeiqunZhang
Copy link
Member

@Change72 Do you have those Backtrace.* files?

@Change72
Copy link
Author

Change72 commented Dec 5, 2023

@WeiqunZhang Yes. I just updated my GitHub repo.

@Change72
Copy link
Author

Change72 commented Dec 5, 2023

@berceanu Summit use V100.
@ax3l It seems the problem is caused by V100. Summit and @berceanu's local cluster both use V100.
berceanu said "On Karolina, with CUDA 11.7.0 on A100", so A100 is fine.

@WeiqunZhang
Copy link
Member

It died in the FiniteDifferenceSolver::EvolveB function in WarpX/Source/FieldSolver/FiniteDifferenceSolver/EvolveB.cpp. Could you try to move that function to the end of that file? If that makes EvolveB work, you will need to make a similar change in EvolveE.cpp.

@Change72
Copy link
Author

Change72 commented Dec 6, 2023

@WeiqunZhang I move both functions and the error message changes. My changes is in my Warpx Github..

The latest output is in a new branch: 20231205

@ax3l
Copy link
Member

ax3l commented Dec 7, 2023

Looks like this is still the same error.

On Summit, I ran:

cuobjdump bin/warpx.3d.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.GENQEDTABLES

and likewise for libs we build in lib/. The SM is correctly set at sm_70 for V100 GPUs.

GCC 9.3.0 is used with NVCC 11.3.1, which are documented as compatible: https://gist.github.com/ax3l/9489132

@ax3l
Copy link
Member

ax3l commented Dec 7, 2023

Suspecting a compiler bug, I will try to switch to cuda/11.7.1 next...

@berceanu
Copy link
Contributor

Any luck with this?

@ax3l
Copy link
Member

ax3l commented Dec 17, 2023

Tests with latest development 9af087d using the CUDA backend and turning PSATD on for compile (everything else on default):

Summit (OLCF w/ V100 on ppc64le; RHEL 8.2) test with GCC 9.3.0 & NVCC 11.7.99 (CUDA 11.7.1):

No problem.

-> Could be a NVCC issue. We know that NVCC <11.7 is pretty buggy for C++17 code.

@ax3l
Copy link
Member

ax3l commented Dec 17, 2023

ping @berceanu is there a CUDA 11.7 or newer on Karolina you can document?
-> ✔️ #4477

ping @lucafedeli88 is there a CUDA 11.7 or newer on Lenoardo you can document?

ping @AlexanderSinn is there a CUDA 11.7 or newer on Juwels you can document?

ping @Change72 fix for Summit is in #4538 :)

@Change72
Copy link
Author

Thanks @ax3l It works on Summit now, even though this is the last day of Summit. I will turn to Frontier later. If possible, could you try it on Frontier as well? I need some time for my Frontier application.

@Change72
Copy link
Author

Another thing I am a little puzzled about: I tried Cuda 11.7.1 before, but it was not working. Are there any other changes except 11.7.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
machine / system Machine or system-specific issue question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants