Skip to content
Bartosz Kostrzewa edited this page Feb 9, 2016 · 21 revisions

Info about the SPI on BG/Q

the header files and examples can be found in /bgsys/drivers/ppcfloor/spi/.

A rather complete documentation of the SPI code is availabe here: IBM MU SPI Doxygen documentation.

I have a test code now, running on 32 nodes, which performs boundary exchange using the SPI, method DirectPut. It can be found in the github repository https://github.com/urbach/Qspi. It combines MPI with the SPI and uses already the tmLQCD MPI-variables. It is based on example code of IBM. How do we deal with the permissions and copyrights?

The runjob command is as follows

runjob --ranks-per-node 1 --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --np 32 --cwd ${WORK}/${NAME} --exe ${HOME}/bglhome/head/c99/Qspi/DirectPut

I have now also included a little test of overlapping computation and communication, and it seems to work quite well. A good fraction of the communication can be hidden when there is enough to compute, see DirectPut.c. (One needs to comment in and out the loop for(int m =...)

  • only communications: 79061 cycles
  • only computation: 153625 cycles
  • comp + comm, no hide: 232212 cycles
  • comp + comm, hide: 179236 cycles

Removing the check that everything was send (which we don't need in tmLQCD, I guess), improves the speed to 176634 cycles.

Dependence on Package size

Increasing the package size from 2048 bytes to 4096, 8192, 16384, 32768 improves speed to 165700, 160231, 157499, 156118 cycles, respectively. The total message size is 32768 bytes right now. So with only one message we almost completely hide the communication!

Combining with openMP

when I use openMP for the computation loop with 64 threads, I get 109176 cycles when hiding communication, compared to 187953 cycles when not hiding.

Current Questions

  • which subgroup ID should one chose for the the BATs? Currently it is 0, but in the so called SPI "docu" (/bgsys/drivers/ppcfloor/spi/doc/html/) it is written: "The MU SPI application should be coded to look for free base address table entries in all of the subgroups associated with the process."

  • how to properly run MPI and SPI together? When I use MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &g_mpi_prov); the SPI does not manage to get the resources needed, even though I use environment variables to reserve them, see above. Also with --envs "PAMID_ASYNC_PROGRESS=1" it doesn't change. Maybe MPI_THREAD_MULTIPLE is not needed? With MPI_THREAD_SERIALIZED it works fine. Using MPI_Init seems to be slightly faster compared to MPI_Init_thread in general.

    • MPI_Init_thread will always be a bit slower because the MPI implementation provides "per-object" locks (ie, locks for the MPI structure for each thread). For MPI_THREAD_SERIALIZED, this is often only true in theory and the implementation is essentially equivalent to MPI_THREAD_MASTER or even the implementation without threading. MPI_THREAD_MULTIPLE is needed only when interleaving communication and computation with MPI (because it requires PAMID_ASYNC_PROGRESS=1, which requires MPI_THREAD_MULTIPLE). I'm currently looking into how this can be achieved in practice to compare to the performance with SPI. If the overhead is small, this can be used for general interleaving for any geometry, saving us the headache of generalizing SPI.

      • I've looked into this more and I've come to the conclusion that our MPI implementation is doing comm/comp overlap but the overhead is very large. I'm guessing this is because MPI seems to require quite a lot of beef from the CPU and multiple MPI threads get prioritized over work threads in order to do comms, thereby reducing overall performance. All these tests were with MPI_THREAD_SERIALIZED. What I said above is not necessarily true (about PAMID_ASYNC_PROGRESS, but I'm not sure I fully understand...)
    • we could consider to use MPI_Put from the MPI-II standard!?

      • No, this doesn't seem to work as required because of the way it's structured. Also in Lugano some people working on other codes reported that MPI_Put is really slow. Maybe not worth trying?

dynamic or static routing?* dynamic seems to be slightly faster in my test programme.

  • whats the optimal package size the best is probably to use one package per direction

  • persistent descriptors: the inverter doesn't converge with SPI if the lines

    muDescriptors[j].Message_Length = msize; muDescriptors[j].Pa_Payload = sendBufPAddr; MUSPI_SetRecPutOffset (&muDescriptors[j], roffsets[j]);

are re-set in every call of Hopping_Matrix. Currently I don't understand, why. A better understanding would be useful.

  • should we do the send loop in SPI with 8 threads?

  • how to properly do it also for the 32 Bit version of halfspinor?