-
Notifications
You must be signed in to change notification settings - Fork 15
Description
We've had two cases where binding happened incorrectly. First, with OpenMPI 5, simply because we were only setting the OMPI_MCA env var OMPI_MCA_rmaps_base_mapping_policy and not its PRTE equivalent. Second, with LPC3D on @laraPPr 's system (possibly because it picked up on some externally set environment) #306 (comment) .
Maybe we should see if the mixin class can be used to make sure that mpirun is run with --report-bindings. One caveat is that this is OpenMPI specific, so we may need to do this conditional on the MPI library used. Then, we can do a sanity check to see if the binding matches what we expect. Note that implementing the sanity check won't be easy. For example, on a 128-core system, with 4 ranks per node and 32 cores per rank, we set:
$ echo $OMPI_MCA_rmaps_base_mapping_policy
slot:PE=32
and this should lead to:
[tcn281.local.snellius.surf.nl:3866125] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]], socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]], socket 0[core 16[hwt 0]], socket 0[core 17[hwt 0]], socket 0[core 18[hwt 0]], socket 0[core 19[hwt 0]], socket 0[core 20[hwt 0]], socket 0[core 21[hwt 0]], socket 0[core 22[hwt 0]], socket 0[core 23[hwt 0]], socket 0[core 24[hwt 0]], socket 0[core 25[hwt 0]], socket 0[core 26[hwt 0]], socket 0[core 27[hwt 0]], socket 0[core 28[hwt 0]], socket 0[core 29[hwt 0]], socket 0[core 30[hwt 0]], socket 0[core 31[hwt 0]]: [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[tcn281.local.snellius.surf.nl:3866125] MCW rank 1 bound to socket 0[core 32[hwt 0]], socket 0[core 33[hwt 0]], socket 0[core 34[hwt 0]], socket 0[core 35[hwt 0]], socket 0[core 36[hwt 0]], socket 0[core 37[hwt 0]], socket 0[core 38[hwt 0]], socket 0[core 39[hwt 0]], socket 0[core 40[hwt 0]], socket 0[core 41[hwt 0]], socket 0[core 42[hwt 0]], socket 0[core 43[hwt 0]], socket 0[core 44[hwt 0]], socket 0[core 45[hwt 0]], socket 0[core 46[hwt 0]], socket 0[core 47[hwt 0]], socket 0[core 48[hwt 0]], socket 0[core 49[hwt 0]], socket 0[core 50[hwt 0]], socket 0[core 51[hwt 0]], socket 0[core 52[hwt 0]], socket 0[core 53[hwt 0]], socket 0[core 54[hwt 0]], socket 0[core 55[hwt 0]], socket 0[core 56[hwt 0]], socket 0[core 57[hwt 0]], socket 0[core 58[hwt 0]], socket 0[core 59[hwt 0]], socket 0[core 60[hwt 0]], socket 0[core 61[hwt 0]], socket 0[core 62[hwt 0]], socket 0[core 63[hwt 0]]: [././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[tcn281.local.snellius.surf.nl:3866125] MCW rank 2 bound to socket 1[core 64[hwt 0]], socket 1[core 65[hwt 0]], socket 1[core 66[hwt 0]], socket 1[core 67[hwt 0]], socket 1[core 68[hwt 0]], socket 1[core 69[hwt 0]], socket 1[core 70[hwt 0]], socket 1[core 71[hwt 0]], socket 1[core 72[hwt 0]], socket 1[core 73[hwt 0]], socket 1[core 74[hwt 0]], socket 1[core 75[hwt 0]], socket 1[core 76[hwt 0]], socket 1[core 77[hwt 0]], socket 1[core 78[hwt 0]], socket 1[core 79[hwt 0]], socket 1[core 80[hwt 0]], socket 1[core 81[hwt 0]], socket 1[core 82[hwt 0]], socket 1[core 83[hwt 0]], socket 1[core 84[hwt 0]], socket 1[core 85[hwt 0]], socket 1[core 86[hwt 0]], socket 1[core 87[hwt 0]], socket 1[core 88[hwt 0]], socket 1[core 89[hwt 0]], socket 1[core 90[hwt 0]], socket 1[core 91[hwt 0]], socket 1[core 92[hwt 0]], socket 1[core 93[hwt 0]], socket 1[core 94[hwt 0]], socket 1[core 95[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././.]
[tcn281.local.snellius.surf.nl:3866125] MCW rank 3 bound to socket 1[core 96[hwt 0]], socket 1[core 97[hwt 0]], socket 1[core 98[hwt 0]], socket 1[core 99[hwt 0]], socket 1[core 100[hwt 0]], socket 1[core 101[hwt 0]], socket 1[core 102[hwt 0]], socket 1[core 103[hwt 0]], socket 1[core 104[hwt 0]], socket 1[core 105[hwt 0]], socket 1[core 106[hwt 0]], socket 1[core 107[hwt 0]], socket 1[core 108[hwt 0]], socket 1[core 109[hwt 0]], socket 1[core 110[hwt 0]], socket 1[core 111[hwt 0]], socket 1[core 112[hwt 0]], socket 1[core 113[hwt 0]], socket 1[core 114[hwt 0]], socket 1[core 115[hwt 0]], socket 1[core 116[hwt 0]], socket 1[core 117[hwt 0]], socket 1[core 118[hwt 0]], socket 1[core 119[hwt 0]], socket 1[core 120[hwt 0]], socket 1[core 121[hwt 0]], socket 1[core 122[hwt 0]], socket 1[core 123[hwt 0]], socket 1[core 124[hwt 0]], socket 1[core 125[hwt 0]], socket 1[core 126[hwt 0]], socket 1[core 127[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
Now, this is a full node allocation, and therefore simple. But if you'd allocate e.g. 8 cores, with 2 cores per task and 4 task, you could get any subset of the 128 physical cores. Thus, checking core numbers is a no-go. The simplest might be to extract the pattern, and do some logic on it. E.g. count the number of Bs in each rank's output, and check that it matches the number 32. Also, count the number of rank's, and check that it matches 4. We could even consider a check that verifies if single ranks go across sockets. If so, that probably shouldn't be an error, but maybe a clear warning (in the error file, or ReFrame log) could be nice.