-
Notifications
You must be signed in to change notification settings - Fork 47
Blue Gene Q Running and Performance
I obtain best performance when using SPI plus overlapping communication and computation. Please make sure you use
runjob --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --ranks-per-node 1
to have enough resources for SPI and MPI together. Moreover, use OMP_NUM_THREADS=64
.
I see a strong dependence on the mapping option. For a 512 node partition with a 48^3x96 global volume it works best if I chose --mapping EABCDT
, because that is 2x4x4x4x4=8x4x4x4.
There is an evironment variable called LOADL_BG_SHAPE
which gives the dimensions in the form AxBxCxDxE
. From this variable one can deduce the correct mapping for runjob. So, E=2
always. A midplane has AxBxCxD=4x4x4x4
. For any combination of midplanes one needs to properly match the input parameters NrXprocs
, NrYProcs
and NrZProcs
and the resulting NrTProcs
to the 5-dim torus of the machine.
(page 112 in BG redbook for converting core dump to proper file)
It seems that from a midplane on (512 nodes), LOADL_BG_SHAPE
gives the topology of midplanes, so for instance for a 1024 node partition something like 1x2x1x1
, which would translate into 4x8x4x4x2
in the AxBxCxDxE
format. Here is an example bash script to set the mapping corresponding to the value of LOADL_BG_SHAPE
for a 1024 node partition and a 48 cube times 96 lattice with MPI mapping 16x4x4x4
:
echo loadl shape is $LOADL_BG_SHAPE
export MP=EABCDT
case ${LOADL_BG_SHAPE} in
2x1x1x1 )
MP=EABCDT
;;
1x2x1x1 )
MP=EBACDT
;;
1x1x2x1 )
MP=ECABDT
;;
1x1x1x2 )
MP=EDABCT
;;
esac
echo mapping is ${MP}
The latest performance numbers are summarised in the following plot:
The performance is shown as a function of the local lattice extend, which is identical in all directions.
- plain C is the original C version of the code
- QPX is the QPX version with communication switched off
- QPX+MPI is the QPX version with MPI communication
- QPX+SPI is the QPX version with SPI communication
-
QPX+SPI+EABCDT is the QPX version with SPI communication and
--mapping EABCDT
Best performance is with 12^4 local lattice 24% of peak. All performance numbers are in double precision.
Interestingly, at least with the clover term, the solver scales above ideally when going from one midplane to one rack (the numbers are spurious because they don't include the computation of the clover term [i.e. the real performance is higher]):
midplane
CG: iter: 534 eps_sq: 1.0000e-22 t/s: 3.2464e+00
CG: flopcount (for e/o tmWilson only): t/s: 3.2464e+00 mflops_local: 11276.7 mflops: 5773675.5
Time for cloverdetratio1 monomial heatbath: 3.645461e+00 s
rack
CG: iter: 534 eps_sq: 1.0000e-22 t/s: 1.3440e+00
CG: flopcount (for e/o tmWilson only): t/s: 1.3440e+00 mflops_local: 13619.4 mflops: 13946272.4
Time for cloverdetratio1 monomial heatbath: 1.544599e+00 s