Description
Git bisect shows that 409638b from @bosilca and @thananon is the first bad commit (it changed how ob1 handles out-of-order receives) that is causing MPI_Gather()
in IMB to hang for me with both the TCP and usNIC BTLs.
This is 100% reproducible for me. When I run IMB across 2 servers (with ppn=16), it will hang in Gather -- I'm pretty sure it hangs once we transition into the long protocol (i.e., after the 64K results are shown for TCP and after the 16K results are shown for usNIC):
$ mpirun --mca btl usnic,vader,self IMB-MPI1 -npmin 32 Gather
benchmarks to run Gather
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
...
#----------------------------------------------------------------
# Benchmarking Gather
# #processes = 32
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.04 0.04 0.04
1 1000 14.78 14.83 14.80
2 1000 14.96 15.01 14.99
4 1000 15.05 15.12 15.09
8 1000 15.32 15.38 15.35
16 1000 15.65 15.71 15.68
32 1000 16.18 16.24 16.21
64 1000 18.18 18.24 18.21
128 1000 20.81 20.87 20.84
256 1000 24.71 24.80 24.76
512 1000 34.46 34.62 34.51
1024 1000 13.87 14.19 14.04
2048 1000 17.38 17.83 17.62
4096 1000 49.83 50.23 50.02
8192 1000 269.86 270.38 270.16
16384 1000 315.06 315.69 315.43
<hang>
$ mpirun --mca btl tcp,vader,self IMB-MPI1 -npmin 32 Gather
benchmarks to run Gather
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
...
#----------------------------------------------------------------
# Benchmarking Gather
# #processes = 32
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.04 0.07 0.04
1 1000 46.73 46.90 46.82
2 1000 46.63 46.80 46.72
4 1000 46.98 47.16 47.06
8 1000 48.44 48.61 48.52
16 1000 51.35 51.57 51.46
32 1000 53.16 53.43 53.33
64 1000 55.42 55.66 55.54
128 1000 59.01 59.20 59.10
256 1000 65.72 65.96 65.83
512 1000 79.08 79.52 79.26
1024 1000 67.23 68.24 67.73
2048 1000 73.87 75.06 74.50
4096 1000 113.16 114.54 114.16
8192 1000 1018.50 1020.58 1019.66
16384 1000 1039.11 1041.27 1040.34
32768 1000 1285.78 1288.46 1287.30
65536 640 1881.22 1884.78 1883.41
<hang>
Note that I have 3 usNIC interfaces and 4 IP interfaces. Hence, receiving frags out of order is highly likely. This might be necessary to reproduce the issue...?
Also note that this is only happening on master -- I checked the timeline: 409638b was committed to master after v3.1 branched, and was not PR'ed over.
@bosilca @thananon What additional information can I get to you to help diagnose what is going wrong?