-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating dataset on Summit: GPU last error detected #4438
Comments
Hi @Change72, Thank you for your report. Can you please share your detailed input + submission script, so we can reproduce it in our envs? Is the same warpx.profile loaded and active when you submit your job? note that we use Another small unrelated detail: your job seems to use 12 GPUs (2 nodes?) but your setup only has 8 grid blocks, so 4 GPUs will be unused right now. |
Hi @ax3l Thank you for your help. I create a public GitHub repo to show my input and submission script. Plus, I first used cuda/11.3.1 and got the same error. Then I try to find online solutions and try cuda/11.7.1, which is not working as well. As for the input file, I have tested it on our own server and it can generate a small dataset smoothly. |
Hi @Change72, Thank you for the details. I compiled your example and can reproduce the issue. Digging further... |
@ax3l, I can reproduce the error on our local cluster at ELI-NP Here is the On Karolina, with |
I wonder if it compiles for the wrong SM (a newer one than Volta) or if a dependency was built with support for the wrong SM. Looks like a local issue on a few clusters, not universal. |
Do all those clusters have V100's? |
This issue is referring to Summit - what cluster are you referring to? |
@Change72 Do you have those Backtrace.* files? |
@WeiqunZhang Yes. I just updated my GitHub repo. |
It died in the FiniteDifferenceSolver::EvolveB function in WarpX/Source/FieldSolver/FiniteDifferenceSolver/EvolveB.cpp. Could you try to move that function to the end of that file? If that makes EvolveB work, you will need to make a similar change in EvolveE.cpp. |
@WeiqunZhang I move both functions and the error message changes. My changes is in my Warpx Github.. The latest output is in a new branch: 20231205 |
Looks like this is still the same error. On Summit, I ran: cuobjdump bin/warpx.3d.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.GENQEDTABLES and likewise for libs we build in GCC 9.3.0 is used with NVCC 11.3.1, which are documented as compatible: https://gist.github.com/ax3l/9489132 |
Suspecting a compiler bug, I will try to switch to |
Any luck with this? |
Tests with latest Summit (OLCF w/ V100 on ppc64le; RHEL 8.2) test with GCC 9.3.0 & NVCC 11.7.99 (CUDA 11.7.1):No problem. -> Could be a NVCC issue. We know that NVCC <11.7 is pretty buggy for C++17 code. |
ping @berceanu is there a CUDA 11.7 or newer on Karolina you can document? ping @lucafedeli88 is there a CUDA 11.7 or newer on Lenoardo you can document? ping @AlexanderSinn is there a CUDA 11.7 or newer on Juwels you can document? |
Thanks @ax3l It works on Summit now, even though this is the last day of Summit. I will turn to Frontier later. If possible, could you try it on Frontier as well? I need some time for my Frontier application. |
Another thing I am a little puzzled about: I tried Cuda 11.7.1 before, but it was not working. Are there any other changes except 11.7.1? |
Hi,
I am working on the Summit and followed the instructions of Summit(OLCF).
I successfully finished the compilation of Warpx. However, when I try to use the script V100 GPUs (16GB). It shows "amrex::Abort::1::GPU last error detected in file /ccs/home/chang/src/warpx/build_summit/_deps/fetchedamrex-src/Src/Base/AMReX_GpuLaunchFunctsG.H line 1132: invalid device function !!!
SIGABRT"
I think it might be caused by the version of Cuda since quokka #21. I tried "module load cuda/11.7.1" as well as other cuda version, but the error persists.
The text was updated successfully, but these errors were encountered: