Stack frame hunt #3870

sbastrakov · 2021-10-15T12:06:00Z

While working on #3860 , we had a discussion with @psychocoderHPC and checked the stack frames produced when using 8th order (4 neighbors) FDTD and the corresponding incident field. Besides the usual suspects (RNG init, png output), there were 336 bytes stack frame in the FDTD kernel and 240 bytes stack frame for the incident field kernel, both with 0 bytes spill stores, 0 bytes spill loads. After looking a little bit into the implementation, we found out the constructor for AOFDTDWeights is actually not constexpr, and also the operator[] has a suspicious check which maybe also makes it not constexpr. So alltogether it is actually not clear what happens with these weights inside the FDTD kernel - are they recalculated each time, or stored in registers (or worse), or some combination of those.

The text was updated successfully, but these errors were encountered:

sbastrakov · 2021-10-15T12:14:13Z

cc @steindev

sbastrakov · 2021-10-15T15:38:32Z

As investigated by @psychocoderHPC , it is maybe due to PML internals and unrelated to the AOFDTD implementation and we misattributed it due to forgetting FDTD and PML has the same kernel template. To be further investigated.

Edit: indeed it was the PML functor used, not the normal FDTD one

sbastrakov · 2021-10-15T15:55:55Z

Commenting out this break which is optional there (the function works either way) doesn't reduge the stack frame value for the kernel, but seems to largely reduce the register use there. Replacing it with return makes matters worse in that regard, and replacing the range for loop with a C-style one doesn't change anything.

sbastrakov · 2021-10-19T11:56:07Z

After some more investigation, the effect also depends on the CUDA version used. E.g. CUDA 11.0 and CUDA 11.4 show different kernels have non-zero stack frames for the same setup.

psychocoderHPC · 2021-10-19T12:38:06Z

Some more places with stack frames

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:99                     auto crossedBoundary = pmacc::DataSpace<simDim>::create(0);
        .loc    116 99 44, function_name $L__info_string842, inlined_at 113 74 29

///home/rwidera/workspace/picongpu/include/pmacc/../pmacc/dimensions/DataSpace.hpp:140                 tmp[i] = value;
        .loc    117 140 17, function_name $L__info_string602, inlined_at 116 99 44
        mov.u32         %r354, 0;
        st.local.u32    [%rd2], %r354;
        st.local.u32    [%rd2+4], %r354;
        st.local.u32    [%rd2+8], %r354;
$L__tmp9619:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:102                         if(offsetToTotalOrigin[d] < m_parameters.beginInternalCellsTotalAllBoundaries[d])
        .loc    116 102 53, function_name $L__info_string842, inlined_at 113 74 29
        setp.lt.s32     %p5, %r15, %r91;
        @%p5 bra        $L__BB33_7;
        bra.uni         $L__BB33_4;

$L__BB33_7:
        .loc    116 0 53
        mov.u32         %r354, -1;
$L__tmp9620:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:103                             crossedBoundary[d] = -1;
        .loc    116 103 29, function_name $L__info_string842, inlined_at 113 74 29
        st.local.u32    [%rd2], %r354;
        bra.uni         $L__BB33_8;
$L__tmp9621:

$L__BB33_4:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:104                         else if(offsetToTotalOrigin[d] >= m_parameters.endInternalCellsTotalAllBoundaries[d])
        .loc    116 104 59, function_name $L__info_string842, inlined_at 113 74 29
        setp.lt.s32     %p6, %r15, %r94;
        @%p6 bra        $L__BB33_6;
        bra.uni         $L__BB33_5;

psychocoderHPC · 2021-10-19T13:16:54Z

With the current dev I observed stack frames in kernelMoveAndMark with the SPEC benchmark if we use the particle shape PQS

ptxas info    : Compiling entry function '_ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_' for 'sm_70'
ptxas info    : Function properties for _ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_
    160 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 27664 bytes smem, 512 bytes cmem[0], 16 bytes cmem[2]

psychocoderHPC · 2021-10-19T13:26:27Z

Here is some more information on why it is important to remove all stack frame usages: https://stackoverflow.com/a/7816434
It is not only about performance but stack frames will require some additional global memory at runtime. PIConGPU is by default only keeping 300MiB memory on the device free. If we execute a kernel that is using stack frames the result can be out of memory during runtime.

fix: one part of ComputationalRadiationPhysics#3870 By using an break within a for loop we triggered using stack frame usage in the GPU kernel.

steindev · 2021-11-25T08:12:37Z

@sbastrakov @psychocoderHPC Any progress or plans for progress here?

psychocoderHPC · 2021-11-26T12:29:17Z

There are still some kernels (e.g. boundary algorithms ) using stack frames we should fix. There is no fixed plan when it should be fixed.

sbastrakov · 2021-11-26T12:52:20Z

@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.

psychocoderHPC · 2021-11-26T12:55:17Z

@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.

pic-build -f -c "-Dalpaka_CUDA_SHOW_REGISTER=ON -Dalpaka_CUDA_KEEP_FILES=ON -Dalpaka_CUDA_SHOW_CODELINES=ON" 2>&1 | tee reg.txt

sbastrakov added component: core in PIConGPU (core application) backend: cuda CUDA backend labels Oct 15, 2021

sbastrakov added this to the 0.7.0 / 1.0.0: Next Stable milestone Oct 15, 2021

sbastrakov changed the title ~~Check constexpr and stack frame in AOFDTD~~ Check constexpr and stack frame in PML Oct 15, 2021

sbastrakov changed the title ~~Check constexpr and stack frame in PML~~ Stack frame hunt Oct 19, 2021

psychocoderHPC mentioned this issue Oct 19, 2021

Unroll interpolation #3859

Merged

2 tasks

psychocoderHPC added a commit to psychocoderHPC/picongpu that referenced this issue Oct 20, 2021

PML: avoid stack frames in GPU kernel

f4f9705

fix: one part of ComputationalRadiationPhysics#3870 By using an break within a for loop we triggered using stack frame usage in the GPU kernel.

psychocoderHPC mentioned this issue Oct 20, 2021

PML: avoid stack frames in GPU kernel #3881

Merged

psychocoderHPC added a commit to psychocoderHPC/picongpu that referenced this issue Oct 20, 2021

PML: avoid stack frames in GPU kernel

43e1de2

fix: one part of ComputationalRadiationPhysics#3870 By using an break within a for loop we triggered using stack frame usage in the GPU kernel.

psychocoderHPC added a commit to psychocoderHPC/picongpu that referenced this issue Oct 20, 2021

PML: avoid stack frames in GPU kernel

cc91c07

fix: one part of ComputationalRadiationPhysics#3870 By using an break within a for loop we triggered using stack frame usage in the GPU kernel.

psychocoderHPC modified the milestones: 0.7.0, 0.8.0/ Next stabel Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack frame hunt #3870

Stack frame hunt #3870

sbastrakov commented Oct 15, 2021

sbastrakov commented Oct 15, 2021

sbastrakov commented Oct 15, 2021 •

edited

Loading

sbastrakov commented Oct 15, 2021 •

edited

Loading

sbastrakov commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

steindev commented Nov 25, 2021

psychocoderHPC commented Nov 26, 2021

sbastrakov commented Nov 26, 2021

psychocoderHPC commented Nov 26, 2021 •

edited

Loading

Stack frame hunt #3870

Stack frame hunt #3870

Comments

sbastrakov commented Oct 15, 2021

sbastrakov commented Oct 15, 2021

sbastrakov commented Oct 15, 2021 • edited Loading

sbastrakov commented Oct 15, 2021 • edited Loading

sbastrakov commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

steindev commented Nov 25, 2021

psychocoderHPC commented Nov 26, 2021

sbastrakov commented Nov 26, 2021

psychocoderHPC commented Nov 26, 2021 • edited Loading

sbastrakov commented Oct 15, 2021 •

edited

Loading

sbastrakov commented Oct 15, 2021 •

edited

Loading

psychocoderHPC commented Nov 26, 2021 •

edited

Loading