Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack frame hunt #3870

Open
sbastrakov opened this issue Oct 15, 2021 · 11 comments
Open

Stack frame hunt #3870

sbastrakov opened this issue Oct 15, 2021 · 11 comments
Labels
backend: cuda CUDA backend component: core in PIConGPU (core application)

Comments

@sbastrakov
Copy link
Member

While working on #3860 , we had a discussion with @psychocoderHPC and checked the stack frames produced when using 8th order (4 neighbors) FDTD and the corresponding incident field. Besides the usual suspects (RNG init, png output), there were 336 bytes stack frame in the FDTD kernel and 240 bytes stack frame for the incident field kernel, both with 0 bytes spill stores, 0 bytes spill loads. After looking a little bit into the implementation, we found out the constructor for AOFDTDWeights is actually not constexpr, and also the operator[] has a suspicious check which maybe also makes it not constexpr. So alltogether it is actually not clear what happens with these weights inside the FDTD kernel - are they recalculated each time, or stored in registers (or worse), or some combination of those.

@sbastrakov sbastrakov added component: core in PIConGPU (core application) backend: cuda CUDA backend labels Oct 15, 2021
@sbastrakov sbastrakov added this to the 0.7.0 / 1.0.0: Next Stable milestone Oct 15, 2021
@sbastrakov
Copy link
Member Author

cc @steindev

@sbastrakov
Copy link
Member Author

sbastrakov commented Oct 15, 2021

As investigated by @psychocoderHPC , it is maybe due to PML internals and unrelated to the AOFDTD implementation and we misattributed it due to forgetting FDTD and PML has the same kernel template. To be further investigated.

Edit: indeed it was the PML functor used, not the normal FDTD one

@sbastrakov
Copy link
Member Author

sbastrakov commented Oct 15, 2021

Commenting out this break which is optional there (the function works either way) doesn't reduge the stack frame value for the kernel, but seems to largely reduce the register use there. Replacing it with return makes matters worse in that regard, and replacing the range for loop with a C-style one doesn't change anything.

@sbastrakov sbastrakov changed the title Check constexpr and stack frame in AOFDTD Check constexpr and stack frame in PML Oct 15, 2021
@sbastrakov
Copy link
Member Author

After some more investigation, the effect also depends on the CUDA version used. E.g. CUDA 11.0 and CUDA 11.4 show different kernels have non-zero stack frames for the same setup.

@psychocoderHPC
Copy link
Member

Some more places with stack frames

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:99                     auto crossedBoundary = pmacc::DataSpace<simDim>::create(0);
        .loc    116 99 44, function_name $L__info_string842, inlined_at 113 74 29

///home/rwidera/workspace/picongpu/include/pmacc/../pmacc/dimensions/DataSpace.hpp:140                 tmp[i] = value;
        .loc    117 140 17, function_name $L__info_string602, inlined_at 116 99 44
        mov.u32         %r354, 0;
        st.local.u32    [%rd2], %r354;
        st.local.u32    [%rd2+4], %r354;
        st.local.u32    [%rd2+8], %r354;
$L__tmp9619:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:102                         if(offsetToTotalOrigin[d] < m_parameters.beginInternalCellsTotalAllBoundaries[d])
        .loc    116 102 53, function_name $L__info_string842, inlined_at 113 74 29
        setp.lt.s32     %p5, %r15, %r91;
        @%p5 bra        $L__BB33_7;
        bra.uni         $L__BB33_4;

$L__BB33_7:
        .loc    116 0 53
        mov.u32         %r354, -1;
$L__tmp9620:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:103                             crossedBoundary[d] = -1;
        .loc    116 103 29, function_name $L__info_string842, inlined_at 113 74 29
        st.local.u32    [%rd2], %r354;
        bra.uni         $L__BB33_8;
$L__tmp9621:

$L__BB33_4:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:104                         else if(offsetToTotalOrigin[d] >= m_parameters.endInternalCellsTotalAllBoundaries[d])
        .loc    116 104 59, function_name $L__info_string842, inlined_at 113 74 29
        setp.lt.s32     %p6, %r15, %r94;
        @%p6 bra        $L__BB33_6;
        bra.uni         $L__BB33_5;

@sbastrakov sbastrakov changed the title Check constexpr and stack frame in PML Stack frame hunt Oct 19, 2021
@psychocoderHPC
Copy link
Member

With the current dev I observed stack frames in kernelMoveAndMark with the SPEC benchmark if we use the particle shape PQS

ptxas info    : Compiling entry function '_ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_' for 'sm_70'
ptxas info    : Function properties for _ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_
    160 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 27664 bytes smem, 512 bytes cmem[0], 16 bytes cmem[2]

@psychocoderHPC
Copy link
Member

Here is some more information on why it is important to remove all stack frame usages: https://stackoverflow.com/a/7816434
It is not only about performance but stack frames will require some additional global memory at runtime. PIConGPU is by default only keeping 300MiB memory on the device free. If we execute a kernel that is using stack frames the result can be out of memory during runtime.

psychocoderHPC added a commit to psychocoderHPC/picongpu that referenced this issue Oct 20, 2021
fix: one part of ComputationalRadiationPhysics#3870

By using an break within a for loop we triggered using stack frame usage
in the GPU kernel.
psychocoderHPC added a commit to psychocoderHPC/picongpu that referenced this issue Oct 20, 2021
fix: one part of ComputationalRadiationPhysics#3870

By using an break within a for loop we triggered using stack frame usage
in the GPU kernel.
psychocoderHPC added a commit to psychocoderHPC/picongpu that referenced this issue Oct 20, 2021
fix: one part of ComputationalRadiationPhysics#3870

By using an break within a for loop we triggered using stack frame usage
in the GPU kernel.
@steindev
Copy link
Member

@sbastrakov @psychocoderHPC Any progress or plans for progress here?

@psychocoderHPC
Copy link
Member

There are still some kernels (e.g. boundary algorithms ) using stack frames we should fix. There is no fixed plan when it should be fixed.

@sbastrakov
Copy link
Member Author

@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Nov 26, 2021

@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.

pic-build -f -c "-Dalpaka_CUDA_SHOW_REGISTER=ON -Dalpaka_CUDA_KEEP_FILES=ON -Dalpaka_CUDA_SHOW_CODELINES=ON" 2>&1 | tee reg.txt 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda CUDA backend component: core in PIConGPU (core application)
Projects
None yet
Development

No branches or pull requests

3 participants