-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Very Slow CPU Plugins #2802
Comments
Both slow plugins are written in logstep paralism schema else they will not provide correct results. In both cases the reason why it is slow is IMO the global reduce within one device. |
Yep, might be. Interesting though, the global reduces such as "count" (0D reduce) are still ok. Just the phase-space (2D map-reduce) and energy histogram (1D map-reduce) are really slow.
That's weird, because the blocks of those are reduced just once in the end and most time is spend doing the map to bins inside a collaborating block (which of course is also atomic). Unless "between" meant "within a block" in your comment. Both plugins are not too different from what we do in current deposition. So just to be sure, what is slow in Alpaka 0.3.4: an atomic within a collaborating block (aka "shared mem") or an atomic across blocks (aka "global mem")? |
Across blocks within a grid. |
Yes, that makes sense. I just wanted to put in brackets what we are doing in those two plugins currently on GPU. |
After about 9hrs with no initial output, I can't say for sure if it's not also deadlocking (with high CPU load) in some cases for those plugins or if the barriers are just extraordinary slow currently. Nevertheless, it does not seem (multiple) energy histograms and phase space can be used in CPU production yet due to it. |
cc @sbastrakov @tdd11235813 this is the Alpaka 3.4 OpenMP 2.0 (blocks) map-reduce performance issue we currently face in PIConGPU. Help wanted, @psychocoderHPC just recently implemented proper atomics in https://github.com/ComputationalRadiationPhysics/alpaka/pull/664 :) We could update PIConGPU |
@psychocoderHPC I update my local alpaka on PIConGPU 0.4.1 to https://github.com/ComputationalRadiationPhysics/alpaka/pull/698 and it improved the situation significantly. (Using GCC 7.30 on Hemera with OpenMP 2 blocks backend but OpenMP 4.5 available.) Thanks a ton, this is awesome! ✨ 💖 |
This issue will receive a two-fold fix: for the current stable PIConGPU For the upcoming next stable PIConGPU 0.5/1.0 release and current PIConGPU |
Some of our regularly used plugins seem to be really, really slow with our
omp2b
backend.a few that stand out, especially with the very low particle number in my setup:
OK, although just 1 core per device seems to do the work (maybe I observed on a node with only few particles):
(probably same routines used as prep for HDF5/ADIOS output?)
--adios.compression blosc:threshold=2048,shuffle=bit,lvl=1,threads=20,compressor=zstd
)OK:
Environment
defq
(24 devices)defq_picongpu.profile
&defq.tpl
8x8x4
The text was updated successfully, but these errors were encountered: