Some Very Slow CPU Plugins #2802

ax3l · 2018-11-13T16:50:11Z

Some of our regularly used plugins seem to be really, really slow with our omp2b backend.

a few that stand out, especially with the very low particle number in my setup:

PhaseSpace (extremely slow)
EnergyHistogram (very slow)

OK, although just 1 core per device seems to do the work (maybe I observed on a node with only few particles):

CountParticles: count macro particles of a species
(probably same routines used as prep for HDF5/ADIOS output?)
ADIOS (1 file per device default; --adios.compression blosc:threshold=2048,shuffle=bit,lvl=1,threads=20,compressor=zstd)

OK:

PNG (40% size, ~100 KByte per Image)
sumEnergy (fields & particles)
HDF5 (as usual a bit slow)

Environment

Hemera (HZDR) CPUs in defq (24 devices)
PIConGPU 0.4.1
default defq_picongpu.profile & defq.tpl
default supercell 8x8x4
box is 1536x2880x1536; cubic cells (dx=dy=dz)
3 ion species and 1 electron species (total of ~1.e9 particles; simulation is field dominated)

The text was updated successfully, but these errors were encountered:

psychocoderHPC · 2018-11-13T18:14:20Z

Both slow plugins are written in logstep paralism schema else they will not provide correct results. In both cases the reason why it is slow is IMO the global reduce within one device.
It could be that it is faster if we update to the latest alpaka because I implemented native openmp atomic. Atomics between alpaka blocks are in alpaka 0.3.X OpenMP barriers which are very slow.

ax3l · 2018-11-13T18:19:25Z

Yep, might be. Interesting though, the global reduces such as "count" (0D reduce) are still ok. Just the phase-space (2D map-reduce) and energy histogram (1D map-reduce) are really slow.

Atomics between alpaka blocks are [...] very slow

That's weird, because the blocks of those are reduced just once in the end and most time is spend doing the map to bins inside a collaborating block (which of course is also atomic). Unless "between" meant "within a block" in your comment.

Both plugins are not too different from what we do in current deposition.

So just to be sure, what is slow in Alpaka 0.3.4: an atomic within a collaborating block (aka "shared mem") or an atomic across blocks (aka "global mem")?

psychocoderHPC · 2018-11-13T19:04:47Z

Across blocks within a grid.
But please do not associate global memory with across block atomics because you can use atomics those are only save between threads in a alpaka block on global memory. This means alpaka atomics are memory hierarchy independent.

ax3l · 2018-11-13T19:29:47Z

Yes, that makes sense. I just wanted to put in brackets what we are doing in those two plugins currently on GPU.

ax3l · 2018-11-14T08:55:59Z

After about 9hrs with no initial output, I can't say for sure if it's not also deadlocking (with high CPU load) in some cases for those plugins or if the barriers are just extraordinary slow currently. Nevertheless, it does not seem (multiple) energy histograms and phase space can be used in CPU production yet due to it.

ax3l · 2018-11-14T13:42:46Z

cc @sbastrakov @tdd11235813 this is the Alpaka 3.4 OpenMP 2.0 (blocks) map-reduce performance issue we currently face in PIConGPU.

Help wanted, @psychocoderHPC just recently implemented proper atomics in https://github.com/ComputationalRadiationPhysics/alpaka/pull/664 :)

We could update PIConGPU dev to alpaka dev and upgrade all upcoming Alpaka 0.4.0 API-changes.

ax3l · 2018-11-14T14:56:32Z

@psychocoderHPC I update my local alpaka on PIConGPU 0.4.1 to https://github.com/ComputationalRadiationPhysics/alpaka/pull/698 and it improved the situation significantly. (Using GCC 7.30 on Hemera with OpenMP 2 blocks backend but OpenMP 4.5 available.)

Thanks a ton, this is awesome! ✨ 💖

ax3l · 2018-11-18T17:25:37Z

This issue will receive a two-fold fix: for the current stable PIConGPU 0.4.* release line, the 0.4.2 release will ship Alpaka 0.3.5 to improve performance: #2810

For the upcoming next stable PIConGPU 0.5/1.0 release and current PIConGPU dev, #2807 will update to Alpaka dev/upcoming-0.4.0 to fix performance.

ax3l added component: plugin in PIConGPU plugin backend: omp2b OpenMP2 backend labels Nov 13, 2018

ax3l assigned ax3l and psychocoderHPC Nov 13, 2018

ax3l added the affects latest release a bug that affects the latest stable release label Nov 13, 2018

ax3l added performance refactoring code change to improve performance or to unify a concept but does not change public API and removed performance labels Nov 13, 2018

ax3l added the bug a bug in the project's code label Nov 14, 2018

ax3l mentioned this issue Nov 14, 2018

backport to 0.3.5 chunk 1 alpaka-group/alpaka#698

Merged

ax3l mentioned this issue Nov 18, 2018

Backports for 0.4.2 #2810

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Very Slow CPU Plugins #2802

Some Very Slow CPU Plugins #2802

ax3l commented Nov 13, 2018 •

edited

Loading

psychocoderHPC commented Nov 13, 2018

ax3l commented Nov 13, 2018 •

edited

Loading

psychocoderHPC commented Nov 13, 2018 •

edited by ax3l

Loading

ax3l commented Nov 13, 2018

ax3l commented Nov 14, 2018 •

edited

Loading

ax3l commented Nov 14, 2018 •

edited

Loading

ax3l commented Nov 14, 2018

ax3l commented Nov 18, 2018 •

edited

Loading

Some Very Slow CPU Plugins #2802

Some Very Slow CPU Plugins #2802

Comments

ax3l commented Nov 13, 2018 • edited Loading

Environment

psychocoderHPC commented Nov 13, 2018

ax3l commented Nov 13, 2018 • edited Loading

psychocoderHPC commented Nov 13, 2018 • edited by ax3l Loading

ax3l commented Nov 13, 2018

ax3l commented Nov 14, 2018 • edited Loading

ax3l commented Nov 14, 2018 • edited Loading

ax3l commented Nov 14, 2018

ax3l commented Nov 18, 2018 • edited Loading

ax3l commented Nov 13, 2018 •

edited

Loading

ax3l commented Nov 13, 2018 •

edited

Loading

psychocoderHPC commented Nov 13, 2018 •

edited by ax3l

Loading

ax3l commented Nov 14, 2018 •

edited

Loading

ax3l commented Nov 14, 2018 •

edited

Loading

ax3l commented Nov 18, 2018 •

edited

Loading