Skip to content

kernel launchers and SIMD vectorization #71

@valassi

Description

@valassi

I finally found some time to pursue some earlier tests on an idea I had from the beginning, namely, trying to implement SIMD vectorization in the C++ code as the same time as SIMT/SPMD parallelisation on the GPU in cuda,

The idea is always the same: event-level parallelism, with execution in lockstep (all events gop through exactly the same sequence of computations).

I pushed a few initial tests in https://github.com/valassi/madgraph4gpu/tree/klas, I will create a WIP PR about that. @roiser , @oliviermattelaer , @hageboeck , I would especially be interested to have some feedback from you :-)

Implementing SIMD in the C++ is closely linked to the API of kernel launchers (and of the methods the kernels internally call) on the GPU. In my previous eemumu_AV implementation, the signature of some c++ methods was modified by adding nevt (a number of events), or instead ievt (an index over events) with respect to the cuda signature, but some lines of code (eg loops on ievt=1..nevt) were commented out as they were just reminders of possible future changes.

The main idea behind all the changes I did is simple: bring the event loop more and more towards the inside. Eventually, the event loop must be the innermost loop. This is because you eventually want to perform every single floating point addition or multiplication in parallel over several events. In practice, one concrete consequence of this is that I had to invert the order of the helicity loop: so far, there was an outer event loop, with an inner loop over helicities within each event, while now there is an outer helicity loop, with an inner loop over events for each helicity.

One limitation of the present code (possible in a simple eemumu calculation) is that there is no loop over nprocesses, because nprocesses=1. This was already assumed, but now I made it much more explicit, removing all dead code and adding FIXME warnings.

So far, I got to this point

  • I introduced fptype_v and cxtype_v vector types, with width neppV=neppM (the same parameter used for AOSOA memory layouts in the GPU, though the choice of the parameter on the CPU and on the GPU may be different: on the former, it is the SIMD width that counts, on the latter it is the cacheline that matters).
  • I have used cxtype_v arguments in the ixxx/oxxx function signatures, which also accept an iepgV parameter (index over event pages). Then I do the loop over the neppV events in each page within ixxx/oxxx.
  • For the moment I am just doing simple tests with autovectorization. I have added -march=native to the build options, and am observing the results of objdump to see if there are some vectorized loops, eg "ymm". I am using Sebastien Ponce's excellent manual as a guide.
  • I observe that, with the last commit where I add cxtype_v arguments in ixxx/oxxx. the number of symbols with "ymm" increases significantly. I take this as a simple proof of concept that there is "a bit more" autovectorization with the new code, i.e. the changes go in the right direction.
  • Note that all he changes I did this summer to implement AOSOA (which were mainly targeted at chieving coalesced memory access in the GPU, with optimal retrieval from cachelines) already allow to work on this. Some further changes on memory layouts however are needed to implement RRRRIIII complex storage as describe below.

A lot is still missing

  • Of course, I see no performance gain yet. I think that the amount of scalar code that is not vectorized is too much to see anything.
  • In addition to the ixxx/oxxx functions, what must be vectorized are the FVV functions. Unlike ixxx/oxxx, where only the outputs are cxtype_v vectors, here the individual inputsshould also become cxtype_v vectors.
  • One of the main changes that is needed, in any case, is to reimplement cxtype_v from being a cxtype[4] to being a complex(fptype[4],fptype[4]), i.e. going from RIRIRIRI to RRRRIIII formats. The has two implications. One, some change in memory layouts will also happen on the GPU (BUT remember that momenta, weights, random numbers are all real, not complex, so no change is needed there). Two, most importantly, a coding of all operations (as done already fo cucomplex) is needed, eg to determine how to sum or multiply two cxtype_v vectors: this is tedious, but should be only ~300 lines of code in a header.
  • Autovectorization is rather unreliable. I would start with trying that out, but then we can try vector compiler extensions, libraries like Vc or VecCore, or (in the worst case) intrinsics.

These changes may result in significant changes in the current interfaces, but I think they would normally lead to a better interface and structure also on the GPU. I'll continue in the next few days...

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions