Unroll interpolation #3859

bernhardmgruber · 2021-10-08T16:33:24Z

This PR places a few #pragma unrolls to request the compiler to unroll the 3 loops across for the trilinear interpolation and the 1 loop over the simulation dimensions. This change mostly affects the MoveAndMark kernel.

SPEC benchmark is run on NVIDIA GeForce RTX 2060.

Without unrolling:

	dev
		calculation  simulation time: 16sec 796msec = 16.796 sec
	llama (P: SoA, F: AoS)
		calculation  simulation time: 16sec 510msec = 16.510 sec
	llama (P: SoA, F: AoSSplit)
		calculation  simulation time: 26sec 350msec = 26.350 sec

With unrolling:

	dev
		calculation  simulation time: 16sec 464msec = 16.464 sec
	llama (P: SoA, F: AoS) #pragma unroll in AssignedTrilinearInterpolation.hpp:
		calculation  simulation time: 17sec 793msec = 17.793 sec
	llama (P: SoA, F: AoSSplit) #pragma unroll in AssignedTrilinearInterpolation.hpp
		calculation  simulation time: 29sec 148msec = 29.148 sec
	llama (P: SoA, F: AoS) #pragma unroll in AssignedTrilinearInterpolation.hpp and FieldToParticleInterpolation.hpp
		calculation  simulation time: 16sec 708msec = 16.708 sec
	llama (P: SoA, F: AoSSplit) #pragma unroll in AssignedTrilinearInterpolation.hpp and FieldToParticleInterpolation.hpp
		calculation  simulation time: 16sec 601msec = 16.601 sec

While the unrolling has little to no effect on the current dev branch (although CUDA now requires less registers and is a few ms faster). The effect is VERY noticable when more LLAMA magic is involved. See especially the llama (P: SoA, F: AoSSplit) run. The reason for the big improvement is that the cuSTL Cursors backed by LLAMA are more complex than the BufferCursor. This comlexity can mostly be folded away when enough code is inlined and unrolled together. Also note that applying the unrolling only to the AssignedTrilinearInterpolation has a worsening effect.

I am too hesitant yet to propose this change as it could make the register count explode if the trilinear interpolation accesses a larger range of values. However, with LLAMA such a change is absolutely needed.

do not merge before 0.6.0 branch is created
rebase against fix using not allowed particle shape #3880

bussmann · 2021-10-10T18:23:01Z

I seem to be daft. I keep looking at the numbers and see no big improvement by unrolling and often an increase in simulation time with Llama.

psychocoderHPC · 2021-10-10T18:40:55Z

@bernhardmgruber please do not run benchmarks on the rtx. This GPU has to high fluctuations and is not our target. A benchmark on V100 or A100 is required.

I find the register fpotprint reduce interesring and we schould check if we see a reduction with TSC too.Default in spec is PQS.

psychocoderHPC · 2021-10-11T05:43:53Z

@bernhardmgruber Is the #pragma unroll keyword known by host compiles too? I would say gcc and clang do not know the keyword.
If this keyword is not know by all compiler we need to use #define PMACC_UNROLL _Pragma("unroll") and the definition should best ported soon to alpaka that we reduce compiler handling in PMacc and PIConGPU.

bernhardmgruber · 2021-10-11T15:29:55Z

I seem to be daft. I keep looking at the numbers and see no big improvement by unrolling and often an increase in simulation time with Llama.

Same data, different representation:

	no #pragma unroll	with #pragma unroll
dev	16.796 sec	16.464 sec
llama (P: SoA, F: AoS)	16.510 sec	16.708 sec
llama (P: SoA, F: AoSSplit)	26.350 sec	16.601 sec

The first column would suggest that the second LLAMA mapping is significantly slower. But it isn't. All 3 runs use the exact same memory layout but produced by more elaborate abstractions. The #pragma unroll helps the compiler in this case to optimize better. This measurement provides an important insight: When benchmarking we also need to pay attention to how much the generated code is impacted by the additional complexity of the used abstractions.

bernhardmgruber · 2021-10-11T15:35:58Z

@bernhardmgruber please do not run benchmarks on the rtx. This GPU has to high fluctuations and is not our target. A benchmark on V100 or A100 is required.

I know. But this is the best GPU I can access fast. On the dev system, I benchmark with 3 other people using the system at the same time, so the results are probably shaky as well. And on hemera I usually have to wait a day to get a 1h slot on the V100. It's just very hard to benchmark there. I will eventually run all benchmarks on the data center GPUs, but for quick things my local one is much easier.

I find the register fpotprint reduce interesring and we schould check if we see a reduction with TSC too.Default in spec is PQS.

I was surprised too! But this might also just be pure luck. I guess if the region for PQS is larger than 4x4x4 we can easily explode in registers again. So the #pragma unroll is very dangerous. But it shows that a slow LLAMA program is not necessarily a memory layout issue.

bernhardmgruber · 2021-10-11T15:37:11Z

@bernhardmgruber Is the #pragma unroll keyword known by host compiles too? I would say gcc and clang do not know the keyword. If this keyword is not know by all compiler we need to use #define PMACC_UNROLL _Pragma("unroll") and the definition should best ported soon to alpaka that we reduce compiler handling in PMacc and PIConGPU.

For gcc we need #pragma GCC unroll n, so yes, it is currently not portable.

bussmann · 2021-10-11T23:14:33Z

Yes, code complexity can slow things down, this is no surprise. Especially when optimizing code compilers seem to be dumb. But again, I see little differences in the numbers. We are talking few percent here. The question is what unrolling does on a CPU.

P.S.: What queues are you using on hemera that you have to wait so long? It would be great to always test on a reliable hardware from the beginning, as your RTX values can give you a wrong impression.

psychocoderHPC · 2021-10-19T13:18:22Z

This PR solves an issue I observed during other tests, for an unknown reason we use stack frames in the SPEC example (CUDA 11.4) with the shape PQS: #3870 (comment)

Applying this PR is removing the stack frames:

ptxas info    : Compiling entry function '_ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_' for 'sm_70'
ptxas info    : Function properties for _ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 27664 bytes smem, 512 bytes cmem[0], 16 bytes cmem[2]

bernhardmgruber · 2021-10-19T13:25:54Z

If this keyword is not know by all compiler we need to use #define PMACC_UNROLL _Pragma("unroll") and the definition should best ported soon to alpaka that we reduce compiler handling in PMacc and PIConGPU.

I will add a PMACC_UNROLL macro.

psychocoderHPC · 2021-10-19T13:38:52Z

SPEC on V100 (developer system):

dev (https://github.com/ComputationalRadiationPhysics/picongpu/tree/1c365c1865effbf91f8877b14e3f3b101eff3c8d)

./bin/picongpu -d 1 1 1 -g 256 256 256 -s 100 -p 5 --periodic 1 1 1
...
100 % =      100 | time elapsed:            25sec 735msec | avg time per step: 272msec
calculation  simulation time: 25sec 735msec = 25.735 sec

dev + cherry-pick this PR

./bin/picongpu -d 1 1 1 -g 256 256 256 -s 100 -p 5 --periodic 1 1 1
...
100 % =      100 | time elapsed:            21sec 411msec | avg time per step: 233msec
calculation  simulation time: 21sec 411msec = 21.411 sec

include/pmacc/attribute/unroll.hpp

psychocoderHPC · 2021-10-20T08:50:57Z

include/pmacc/attribute/unroll.hpp

+#        define PMACC_UNROLL(var)
+#    else
+#        define PMACC_UNROLL(var)
+#        warning PMACC_UNROLL is not implemented for your compiler


We should not throw a warning, the typical user does not know what this means and this is only about performance.
Maybe we should activate it only if someone is compiling in debug mode.

This warning is displayed if we did not yet think about a particular compiler. It is not triggered when there is no implementation. So e.g. for MSVC and g++, there is no implementation for this pragma, but you will not get a warning.

OK that's fine I thought you will remove the GCC part fully but how it is implemented now it's IMO fine

OK great! :)

I filed a bugreport to GCC, asking to have the feature implemented: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102855

bernhardmgruber · 2021-10-20T12:56:09Z

Now we have a weird failure about unrolling a loop with 0 iterations:

/builds/hzdr/crp/picongpu/include/pmacc/../picongpu/algorithms/FieldToParticleInterpolation.hpp:87:59: note: in instantiation of function template specialization 'picongpu::AssignedTrilinearInterpolation::interpolate<picongpu::particles::shapes::Counter::ChargeAssignmentOnSupport, 1, 0, pmacc::cursor::Cursor<pmacc::cursor::FunctorAccessor<pmacc::algorithm::functor::GetComponent<double>, pmacc::math::Vector<double, 3, pmacc::math::StandardAccessor, pmacc::math::StandardNavigator, pmacc::math::detail::Vector_components<double, 3>>>, pmacc::cursor::CursorNavigator, pmacc::cursor::CT::BufferCursor<pmacc::math::Vector<double, 3, pmacc::math::StandardAccessor, pmacc::math::StandardNavigator, pmacc::math::detail::Vector_components<double, 3>>, pmacc::math::CT::Int<192, 1536, 2147483647>>>>' requested here
                result[i] = InterpolationMethod::template interpolate<AssignmentFunction, begin, end>(

Essentially, interpolate is instantiated with begin = 1 and end = 0. Looking at AssignedTrilinearInterpolation, that returned 0 before. My question is however, if this is a valid use case? If yes, I would need some additional code in the interpolator.

See CI run here: https://gitlab.com/hzdr/crp/picongpu/-/jobs/1697136284

psychocoderHPC · 2021-10-20T14:05:46Z

The bug you saw will be fixed with #3880

psychocoderHPC · 2021-10-21T11:31:32Z

@bernhardmgruber please rebase against the dev branch #3880 is now merged.

vastly improves performance in SPEC benchmark

bussmann marked this pull request as ready for review October 10, 2021 18:21

psychocoderHPC added this to the 0.7.0 / 1.0.0: Next Stable milestone Oct 11, 2021

psychocoderHPC added the refactoring code change to improve performance or to unify a concept but does not change public API label Oct 11, 2021

bernhardmgruber marked this pull request as draft October 11, 2021 14:59

bernhardmgruber force-pushed the unroll branch from 794f6d1 to 8ce7d7d Compare October 15, 2021 08:23

bernhardmgruber mentioned this pull request Oct 19, 2021

#pragma unroll alpaka-group/alpaka#1425

Closed

bernhardmgruber force-pushed the unroll branch from 8ce7d7d to a53fde5 Compare October 19, 2021 16:31

psychocoderHPC reviewed Oct 19, 2021

View reviewed changes

include/pmacc/attribute/unroll.hpp Outdated Show resolved Hide resolved

bernhardmgruber marked this pull request as ready for review October 19, 2021 20:46

psychocoderHPC reviewed Oct 20, 2021

View reviewed changes

bernhardmgruber force-pushed the unroll branch 2 times, most recently from 7c764b5 to dd193df Compare October 20, 2021 10:11

psychocoderHPC mentioned this pull request Oct 20, 2021

fix using not allowed particle shape #3880

Merged

unroll critical loops

74d24df

vastly improves performance in SPEC benchmark

bernhardmgruber force-pushed the unroll branch from dd193df to 74d24df Compare October 21, 2021 12:45

psychocoderHPC approved these changes Oct 22, 2021

View reviewed changes

psychocoderHPC merged commit f180805 into ComputationalRadiationPhysics:dev Oct 22, 2021

bernhardmgruber deleted the unroll branch October 25, 2021 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unroll interpolation #3859

Unroll interpolation #3859

bernhardmgruber commented Oct 8, 2021 •

edited by psychocoderHPC

Loading

bussmann commented Oct 10, 2021

psychocoderHPC commented Oct 10, 2021

psychocoderHPC commented Oct 11, 2021

bernhardmgruber commented Oct 11, 2021

bernhardmgruber commented Oct 11, 2021

bernhardmgruber commented Oct 11, 2021

bussmann commented Oct 11, 2021

psychocoderHPC commented Oct 19, 2021 •

edited

Loading

bernhardmgruber commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

psychocoderHPC Oct 20, 2021

bernhardmgruber Oct 20, 2021

psychocoderHPC Oct 20, 2021

bernhardmgruber Oct 20, 2021

bernhardmgruber Oct 20, 2021

bernhardmgruber commented Oct 20, 2021 •

edited

Loading

psychocoderHPC commented Oct 20, 2021

psychocoderHPC commented Oct 21, 2021

Unroll interpolation #3859

Unroll interpolation #3859

Conversation

bernhardmgruber commented Oct 8, 2021 • edited by psychocoderHPC Loading

bussmann commented Oct 10, 2021

psychocoderHPC commented Oct 10, 2021

psychocoderHPC commented Oct 11, 2021

bernhardmgruber commented Oct 11, 2021

bernhardmgruber commented Oct 11, 2021

bernhardmgruber commented Oct 11, 2021

bussmann commented Oct 11, 2021

psychocoderHPC commented Oct 19, 2021 • edited Loading

bernhardmgruber commented Oct 19, 2021

psychocoderHPC commented Oct 19, 2021

psychocoderHPC Oct 20, 2021

Choose a reason for hiding this comment

bernhardmgruber Oct 20, 2021

Choose a reason for hiding this comment

psychocoderHPC Oct 20, 2021

Choose a reason for hiding this comment

bernhardmgruber Oct 20, 2021

Choose a reason for hiding this comment

bernhardmgruber Oct 20, 2021

Choose a reason for hiding this comment

bernhardmgruber commented Oct 20, 2021 • edited Loading

psychocoderHPC commented Oct 20, 2021

psychocoderHPC commented Oct 21, 2021

bernhardmgruber commented Oct 8, 2021 •

edited by psychocoderHPC

Loading

psychocoderHPC commented Oct 19, 2021 •

edited

Loading

bernhardmgruber commented Oct 20, 2021 •

edited

Loading