-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack frame hunt #3870
Comments
cc @steindev |
As investigated by @psychocoderHPC , it is maybe due to PML internals and unrelated to the AOFDTD implementation and we misattributed it due to forgetting FDTD and PML has the same kernel template. To be further investigated. Edit: indeed it was the PML functor used, not the normal FDTD one |
Commenting out this break which is optional there (the function works either way) doesn't reduge the stack frame value for the kernel, but seems to largely reduce the register use there. Replacing it with |
After some more investigation, the effect also depends on the CUDA version used. E.g. CUDA 11.0 and CUDA 11.4 show different kernels have non-zero stack frames for the same setup. |
Some more places with stack frames
|
With the current dev I observed stack frames in kernelMoveAndMark with the SPEC benchmark if we use the particle shape
|
Here is some more information on why it is important to remove all stack frame usages: https://stackoverflow.com/a/7816434 |
fix: one part of ComputationalRadiationPhysics#3870 By using an break within a for loop we triggered using stack frame usage in the GPU kernel.
fix: one part of ComputationalRadiationPhysics#3870 By using an break within a for loop we triggered using stack frame usage in the GPU kernel.
fix: one part of ComputationalRadiationPhysics#3870 By using an break within a for loop we triggered using stack frame usage in the GPU kernel.
@sbastrakov @psychocoderHPC Any progress or plans for progress here? |
There are still some kernels (e.g. boundary algorithms ) using stack frames we should fix. There is no fixed plan when it should be fixed. |
@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it. |
|
While working on #3860 , we had a discussion with @psychocoderHPC and checked the stack frames produced when using 8th order (4 neighbors) FDTD and the corresponding incident field. Besides the usual suspects (RNG init, png output), there were
336 bytes stack frame
in the FDTD kernel and240 bytes stack frame
for the incident field kernel, both with0 bytes spill stores, 0 bytes spill loads
. After looking a little bit into the implementation, we found out the constructor forAOFDTDWeights
is actually notconstexpr
, and also theoperator[]
has a suspicious check which maybe also makes it notconstexpr
. So alltogether it is actually not clear what happens with these weights inside the FDTD kernel - are they recalculated each time, or stored in registers (or worse), or some combination of those.The text was updated successfully, but these errors were encountered: