Description
Dear Open MPI team,
A few days ago, my colleague Daniel Tameling noticed severe performance issues when running the HPCC benchmark with Open MPI. After spending quite some time tracking down the reason, we suspect that a regression was introduced between Open MPI 2.0.1 and 2.0.2. More specifically: the Open MPI 2.0.1 release tarball seems to be okay and the 2.0.2 release shows issues that persist through the openmpi-v2.0.x-201702170256-5fa504b nightly build.
The issue is that the new releases seem to be severely affected by the CPU frequency set using the acpi-cpufreq driver. When the "ondemand" governor is active and set to allow frequencies between 1.20 GHz and 2.40 GHz, the performance difference between Open MPI 2.0.1 and versions > 2.0.2 is almost a factor of two. Only when the governor "userspace" is used to pin the frequency to the maximum+turbo, the two versions show similar performance.
It does not seem to depend on the PML, BTL, MTL or even fabric. We tested FDR, EDR, openib, mxm, ob1, yalla, cm (on IB with Slurm), and cm, psm2 (on OPA with PBS).
The following latencies were measured on 2 nodes, 2 sockets Intel Xeon E5-2680v4 connected using InfiniBand EDR:
ompi-2.0.3-5fa504b, "ondemand" at 1.20 GHz-2.40 GHz:
$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw006:37946] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:47684] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[1487330514.524672] [hsw006:37959:0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 2401.00
[1487330514.522429] [hsw007:47691:0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 2401.00
# OSU MPI Latency Test v5.3
# Size Latency (us)
0 2.23
1 2.23
2 2.20
4 2.19
8 2.19
16 2.27
32 2.28
64 2.37
128 3.21
256 3.37
512 3.61
1024 3.98
2048 4.76
4096 6.41
8192 10.12
16384 16.37
32768 20.39
65536 26.40
131072 37.37
262144 59.33
524288 102.96
1048576 191.14
2097152 364.29
4194304 711.79
ompi-2.0.1, "ondemand" at 1.20 GHz-2.40 GHz:
$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:37973] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:47714] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[1487330533.943496] [hsw006:37990:0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 2401.00
[1487330533.948853] [hsw007:47721:0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 2401.00
# OSU MPI Latency Test v5.3
# Size Latency (us)
0 1.15
1 1.14
2 1.13
4 1.10
8 1.10
16 1.13
32 1.14
64 1.17
128 1.58
256 1.65
512 1.77
1024 1.96
2048 2.35
4096 3.23
8192 5.05
16384 8.17
32768 10.21
65536 13.27
131072 18.74
262144 29.59
524288 51.32
1048576 95.44
2097152 182.00
4194304 355.76
ompi-2.0.3-5fa504b, "userspace" at 1.8 GHz:
$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:41654] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:51373] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size Latency (us)
0 1.74
1 1.78
2 1.78
4 1.78
8 1.78
16 1.83
32 1.84
64 1.93
128 2.54
256 2.70
512 2.93
1024 3.29
2048 4.04
4096 5.62
8192 9.20
16384 12.25
32768 14.89
65536 18.98
131072 26.13
262144 41.60
524288 69.95
1048576 128.21
2097152 244.13
4194304 475.81
ompi-2.0.1, "userspace" at 1.8 GHz:
$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:41690] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:51407] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size Latency (us)
0 1.27
1 1.30
2 1.30
4 1.30
8 1.30
16 1.35
32 1.35
64 1.38
128 1.86
256 1.97
512 2.14
1024 2.43
2048 2.99
4096 4.23
8192 6.81
16384 9.12
32768 11.11
65536 14.12
131072 19.51
262144 30.47
524288 52.76
1048576 95.91
2097152 182.81
4194304 356.55
ompi-2.0.3-5fa504b, "userspace" at 2.4 GHz (turbo on):
$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:45372] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:55141] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size Latency (us)
0 1.09
1 1.10
2 1.10
4 1.09
8 1.09
16 1.14
32 1.14
64 1.17
128 1.60
256 1.68
512 1.79
1024 1.98
2048 2.37
4096 3.21
8192 5.07
16384 8.20
32768 10.20
65536 13.22
131072 18.69
262144 29.67
524288 51.43
1048576 95.50
2097152 182.12
4194304 355.87
ompi-2.0.1, "userspace" at 2.4 GHz (turbo on):
$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:45403] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:55175] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size Latency (us)
0 1.10
1 1.11
2 1.09
4 1.09
8 1.08
16 1.13
32 1.13
64 1.16
128 1.57
256 1.65
512 1.76
1024 1.96
2048 2.35
4096 3.23
8192 5.04
16384 8.15
32768 10.17
65536 13.21
131072 18.71
262144 29.55
524288 51.31
1048576 95.37
2097152 182.00
4194304 355.76
The Open MPI version (2.0.2a pre-release) in the HPC-X toolkit version 1.8.0 shows the same issues. Earlier releases (e.g., 1.10.2) also seem to be unaffected.
We are quite stumped as to what could be going on. (My gut feeling would be to blame the recent timer changes, but I really have no idea.)
In any case, thank you for your work on Open MPI!