-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unroll interpolation #3859
Unroll interpolation #3859
Conversation
I seem to be daft. I keep looking at the numbers and see no big improvement by unrolling and often an increase in simulation time with Llama. |
@bernhardmgruber please do not run benchmarks on the rtx. This GPU has to high fluctuations and is not our target. A benchmark on V100 or A100 is required. I find the register fpotprint reduce interesring and we schould check if we see a reduction with TSC too.Default in spec is PQS. |
@bernhardmgruber Is the |
Same data, different representation:
The first column would suggest that the second LLAMA mapping is significantly slower. But it isn't. All 3 runs use the exact same memory layout but produced by more elaborate abstractions. The #pragma unroll helps the compiler in this case to optimize better. This measurement provides an important insight: When benchmarking we also need to pay attention to how much the generated code is impacted by the additional complexity of the used abstractions. |
I know. But this is the best GPU I can access fast. On the dev system, I benchmark with 3 other people using the system at the same time, so the results are probably shaky as well. And on hemera I usually have to wait a day to get a 1h slot on the V100. It's just very hard to benchmark there. I will eventually run all benchmarks on the data center GPUs, but for quick things my local one is much easier.
I was surprised too! But this might also just be pure luck. I guess if the region for |
For gcc we need |
Yes, code complexity can slow things down, this is no surprise. Especially when optimizing code compilers seem to be dumb. But again, I see little differences in the numbers. We are talking few percent here. The question is what unrolling does on a CPU. P.S.: What queues are you using on hemera that you have to wait so long? It would be great to always test on a reliable hardware from the beginning, as your RTX values can give you a wrong impression. |
794f6d1
to
8ce7d7d
Compare
This PR solves an issue I observed during other tests, for an unknown reason we use stack frames in the SPEC example (CUDA 11.4) with the shape PQS: #3870 (comment) Applying this PR is removing the stack frames:
|
I will add a |
SPEC on V100 (developer system):
|
8ce7d7d
to
a53fde5
Compare
# define PMACC_UNROLL(var) | ||
# else | ||
# define PMACC_UNROLL(var) | ||
# warning PMACC_UNROLL is not implemented for your compiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not throw a warning, the typical user does not know what this means and this is only about performance.
Maybe we should activate it only if someone is compiling in debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is displayed if we did not yet think about a particular compiler. It is not triggered when there is no implementation. So e.g. for MSVC and g++, there is no implementation for this pragma, but you will not get a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's fine I thought you will remove the GCC part fully but how it is implemented now it's IMO fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK great! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a bugreport to GCC, asking to have the feature implemented: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102855
7c764b5
to
dd193df
Compare
Now we have a weird failure about unrolling a loop with 0 iterations: /builds/hzdr/crp/picongpu/include/pmacc/../picongpu/algorithms/FieldToParticleInterpolation.hpp:87:59: note: in instantiation of function template specialization 'picongpu::AssignedTrilinearInterpolation::interpolate<picongpu::particles::shapes::Counter::ChargeAssignmentOnSupport, 1, 0, pmacc::cursor::Cursor<pmacc::cursor::FunctorAccessor<pmacc::algorithm::functor::GetComponent<double>, pmacc::math::Vector<double, 3, pmacc::math::StandardAccessor, pmacc::math::StandardNavigator, pmacc::math::detail::Vector_components<double, 3>>>, pmacc::cursor::CursorNavigator, pmacc::cursor::CT::BufferCursor<pmacc::math::Vector<double, 3, pmacc::math::StandardAccessor, pmacc::math::StandardNavigator, pmacc::math::detail::Vector_components<double, 3>>, pmacc::math::CT::Int<192, 1536, 2147483647>>>>' requested here
result[i] = InterpolationMethod::template interpolate<AssignmentFunction, begin, end>( Essentially, See CI run here: https://gitlab.com/hzdr/crp/picongpu/-/jobs/1697136284 |
The bug you saw will be fixed with #3880 |
@bernhardmgruber please rebase against the dev branch #3880 is now merged. |
vastly improves performance in SPEC benchmark
dd193df
to
74d24df
Compare
This PR places a few
#pragma unroll
s to request the compiler to unroll the 3 loops across for the trilinear interpolation and the 1 loop over the simulation dimensions. This change mostly affects theMoveAndMark
kernel.SPEC benchmark is run on NVIDIA GeForce RTX 2060.
Without unrolling:
With unrolling:
While the unrolling has little to no effect on the current
dev
branch (although CUDA now requires less registers and is a few ms faster). The effect is VERY noticable when more LLAMA magic is involved. See especially thellama (P: SoA, F: AoSSplit)
run. The reason for the big improvement is that the cuSTL Cursors backed by LLAMA are more complex than the BufferCursor. This comlexity can mostly be folded away when enough code is inlined and unrolled together. Also note that applying the unrolling only to theAssignedTrilinearInterpolation
has a worsening effect.I am too hesitant yet to propose this change as it could make the register count explode if the trilinear interpolation accesses a larger range of values. However, with LLAMA such a change is absolutely needed.