Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unroll interpolation #3859

Merged

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Oct 8, 2021

This PR places a few #pragma unrolls to request the compiler to unroll the 3 loops across for the trilinear interpolation and the 1 loop over the simulation dimensions. This change mostly affects the MoveAndMark kernel.

SPEC benchmark is run on NVIDIA GeForce RTX 2060.

Without unrolling:

	dev
		calculation  simulation time: 16sec 796msec = 16.796 sec
	llama (P: SoA, F: AoS)
		calculation  simulation time: 16sec 510msec = 16.510 sec
	llama (P: SoA, F: AoSSplit)
		calculation  simulation time: 26sec 350msec = 26.350 sec

With unrolling:

	dev
		calculation  simulation time: 16sec 464msec = 16.464 sec
	llama (P: SoA, F: AoS) #pragma unroll in AssignedTrilinearInterpolation.hpp:
		calculation  simulation time: 17sec 793msec = 17.793 sec
	llama (P: SoA, F: AoSSplit) #pragma unroll in AssignedTrilinearInterpolation.hpp
		calculation  simulation time: 29sec 148msec = 29.148 sec
	llama (P: SoA, F: AoS) #pragma unroll in AssignedTrilinearInterpolation.hpp and FieldToParticleInterpolation.hpp
		calculation  simulation time: 16sec 708msec = 16.708 sec
	llama (P: SoA, F: AoSSplit) #pragma unroll in AssignedTrilinearInterpolation.hpp and FieldToParticleInterpolation.hpp
		calculation  simulation time: 16sec 601msec = 16.601 sec

While the unrolling has little to no effect on the current dev branch (although CUDA now requires less registers and is a few ms faster). The effect is VERY noticable when more LLAMA magic is involved. See especially the llama (P: SoA, F: AoSSplit) run. The reason for the big improvement is that the cuSTL Cursors backed by LLAMA are more complex than the BufferCursor. This comlexity can mostly be folded away when enough code is inlined and unrolled together. Also note that applying the unrolling only to the AssignedTrilinearInterpolation has a worsening effect.

I am too hesitant yet to propose this change as it could make the register count explode if the trilinear interpolation accesses a larger range of values. However, with LLAMA such a change is absolutely needed.

@bussmann bussmann marked this pull request as ready for review October 10, 2021 18:21
@bussmann
Copy link
Member

I seem to be daft. I keep looking at the numbers and see no big improvement by unrolling and often an increase in simulation time with Llama.

@psychocoderHPC
Copy link
Member

@bernhardmgruber please do not run benchmarks on the rtx. This GPU has to high fluctuations and is not our target. A benchmark on V100 or A100 is required.

I find the register fpotprint reduce interesring and we schould check if we see a reduction with TSC too.Default in spec is PQS.

@psychocoderHPC
Copy link
Member

@bernhardmgruber Is the #pragma unroll keyword known by host compiles too? I would say gcc and clang do not know the keyword.
If this keyword is not know by all compiler we need to use #define PMACC_UNROLL _Pragma("unroll") and the definition should best ported soon to alpaka that we reduce compiler handling in PMacc and PIConGPU.

@psychocoderHPC psychocoderHPC added the refactoring code change to improve performance or to unify a concept but does not change public API label Oct 11, 2021
@bernhardmgruber bernhardmgruber marked this pull request as draft October 11, 2021 14:59
@bernhardmgruber
Copy link
Contributor Author

I seem to be daft. I keep looking at the numbers and see no big improvement by unrolling and often an increase in simulation time with Llama.

Same data, different representation:

  no #pragma unroll with #pragma unroll
dev 16.796 sec 16.464 sec
llama (P: SoA, F: AoS) 16.510 sec 16.708 sec
llama (P: SoA, F: AoSSplit) 26.350 sec 16.601 sec

The first column would suggest that the second LLAMA mapping is significantly slower. But it isn't. All 3 runs use the exact same memory layout but produced by more elaborate abstractions. The #pragma unroll helps the compiler in this case to optimize better. This measurement provides an important insight: When benchmarking we also need to pay attention to how much the generated code is impacted by the additional complexity of the used abstractions.

@bernhardmgruber
Copy link
Contributor Author

@bernhardmgruber please do not run benchmarks on the rtx. This GPU has to high fluctuations and is not our target. A benchmark on V100 or A100 is required.

I know. But this is the best GPU I can access fast. On the dev system, I benchmark with 3 other people using the system at the same time, so the results are probably shaky as well. And on hemera I usually have to wait a day to get a 1h slot on the V100. It's just very hard to benchmark there. I will eventually run all benchmarks on the data center GPUs, but for quick things my local one is much easier.

I find the register fpotprint reduce interesring and we schould check if we see a reduction with TSC too.Default in spec is PQS.

I was surprised too! But this might also just be pure luck. I guess if the region for PQS is larger than 4x4x4 we can easily explode in registers again. So the #pragma unroll is very dangerous. But it shows that a slow LLAMA program is not necessarily a memory layout issue.

@bernhardmgruber
Copy link
Contributor Author

@bernhardmgruber Is the #pragma unroll keyword known by host compiles too? I would say gcc and clang do not know the keyword. If this keyword is not know by all compiler we need to use #define PMACC_UNROLL _Pragma("unroll") and the definition should best ported soon to alpaka that we reduce compiler handling in PMacc and PIConGPU.

For gcc we need #pragma GCC unroll n, so yes, it is currently not portable.

@bussmann
Copy link
Member

Yes, code complexity can slow things down, this is no surprise. Especially when optimizing code compilers seem to be dumb. But again, I see little differences in the numbers. We are talking few percent here. The question is what unrolling does on a CPU.

P.S.: What queues are you using on hemera that you have to wait so long? It would be great to always test on a reliable hardware from the beginning, as your RTX values can give you a wrong impression.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Oct 19, 2021

This PR solves an issue I observed during other tests, for an unknown reason we use stack frames in the SPEC example (CUDA 11.4) with the shape PQS: #3870 (comment)

Applying this PR is removing the stack frames:

ptxas info    : Compiling entry function '_ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_' for 'sm_70'
ptxas info    : Function properties for _ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 27664 bytes smem, 512 bytes cmem[0], 16 bytes cmem[2]

@bernhardmgruber
Copy link
Contributor Author

If this keyword is not know by all compiler we need to use #define PMACC_UNROLL _Pragma("unroll") and the definition should best ported soon to alpaka that we reduce compiler handling in PMacc and PIConGPU.

I will add a PMACC_UNROLL macro.

@psychocoderHPC
Copy link
Member

SPEC on V100 (developer system):

./bin/picongpu -d 1 1 1 -g 256 256 256 -s 100 -p 5 --periodic 1 1 1
...
100 % =      100 | time elapsed:            25sec 735msec | avg time per step: 272msec
calculation  simulation time: 25sec 735msec = 25.735 sec
  • dev + cherry-pick this PR
./bin/picongpu -d 1 1 1 -g 256 256 256 -s 100 -p 5 --periodic 1 1 1
...
100 % =      100 | time elapsed:            21sec 411msec | avg time per step: 233msec
calculation  simulation time: 21sec 411msec = 21.411 sec

@bernhardmgruber bernhardmgruber marked this pull request as ready for review October 19, 2021 20:46
# define PMACC_UNROLL(var)
# else
# define PMACC_UNROLL(var)
# warning PMACC_UNROLL is not implemented for your compiler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not throw a warning, the typical user does not know what this means and this is only about performance.
Maybe we should activate it only if someone is compiling in debug mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is displayed if we did not yet think about a particular compiler. It is not triggered when there is no implementation. So e.g. for MSVC and g++, there is no implementation for this pragma, but you will not get a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's fine I thought you will remove the GCC part fully but how it is implemented now it's IMO fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a bugreport to GCC, asking to have the feature implemented: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102855

@bernhardmgruber
Copy link
Contributor Author

bernhardmgruber commented Oct 20, 2021

Now we have a weird failure about unrolling a loop with 0 iterations:

/builds/hzdr/crp/picongpu/include/pmacc/../picongpu/algorithms/FieldToParticleInterpolation.hpp:87:59: note: in instantiation of function template specialization 'picongpu::AssignedTrilinearInterpolation::interpolate<picongpu::particles::shapes::Counter::ChargeAssignmentOnSupport, 1, 0, pmacc::cursor::Cursor<pmacc::cursor::FunctorAccessor<pmacc::algorithm::functor::GetComponent<double>, pmacc::math::Vector<double, 3, pmacc::math::StandardAccessor, pmacc::math::StandardNavigator, pmacc::math::detail::Vector_components<double, 3>>>, pmacc::cursor::CursorNavigator, pmacc::cursor::CT::BufferCursor<pmacc::math::Vector<double, 3, pmacc::math::StandardAccessor, pmacc::math::StandardNavigator, pmacc::math::detail::Vector_components<double, 3>>, pmacc::math::CT::Int<192, 1536, 2147483647>>>>' requested here
                result[i] = InterpolationMethod::template interpolate<AssignmentFunction, begin, end>(

Essentially, interpolate is instantiated with begin = 1 and end = 0. Looking at AssignedTrilinearInterpolation, that returned 0 before. My question is however, if this is a valid use case? If yes, I would need some additional code in the interpolator.

See CI run here: https://gitlab.com/hzdr/crp/picongpu/-/jobs/1697136284

@psychocoderHPC
Copy link
Member

The bug you saw will be fixed with #3880

@psychocoderHPC
Copy link
Member

@bernhardmgruber please rebase against the dev branch #3880 is now merged.

vastly improves performance in SPEC benchmark
@psychocoderHPC psychocoderHPC merged commit f180805 into ComputationalRadiationPhysics:dev Oct 22, 2021
@bernhardmgruber bernhardmgruber deleted the unroll branch October 25, 2021 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring code change to improve performance or to unify a concept but does not change public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants