-
Notifications
You must be signed in to change notification settings - Fork 514
Description
We are trying to upgrade from UCX 1.18.1 to 1.19.1 and the changes in #10401 (identified via git bisect) are causing significant performance regressions for many of our test cases.
Running with UCX_PROTO_INFO=y shows the following:
UCX 1.18.1
[1767348583.013901] [snnhpc02n001:609154:0] +--------------------------------+--------------------------------------------------------------------+
[1767348583.013935] [snnhpc02n001:609154:0] | ucp_context_0 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from generic host memory |
[1767348583.013941] [snnhpc02n001:609154:0] +--------------------------------+------------------------------------------------------+-------------+
[1767348583.013948] [snnhpc02n001:609154:0] | 0..8248 | eager copy-in copy-out | sysv/memory |
[1767348583.013953] [snnhpc02n001:609154:0] | 8249..142022580 | multi-frag eager copy-in copy-out | sysv/memory |
[1767348583.013958] [snnhpc02n001:609154:0] | 142022581..inf | (?) rendezvous fragmented copy-in copy-out | sysv/memory |
[1767348583.013963] [snnhpc02n001:609154:0] +--------------------------------+------------------------------------------------------+-------------+
UCX 1.19.1
[1767348838.306910] [snnhpc02n001:614374:0] +--------------------------------+--------------------------------------------------------------------+
[1767348838.306944] [snnhpc02n001:614374:0] | ucp_context_0 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from generic host memory |
[1767348838.306951] [snnhpc02n001:614374:0] +--------------------------------+------------------------------------------------------+-------------+
[1767348838.306958] [snnhpc02n001:614374:0] | 0..8248 | eager copy-in copy-out | sysv/memory |
[1767348838.306963] [snnhpc02n001:614374:0] | 8249..238621 | multi-frag eager copy-in copy-out | sysv/memory |
[1767348838.306967] [snnhpc02n001:614374:0] | 238622..inf | (?) rendezvous fragmented copy-in copy-out | sysv/memory |
[1767348838.306972] [snnhpc02n001:614374:0] +--------------------------------+------------------------------------------------------+-------------+
Running UCX 1.19.1 with UCX_RNDV_THRESH=142022580 restores the UCX 1.18.1 performance, so it appears like the tuning heuristics made a wrong decision.
With UCX 1.20.0, the thresholds are still the same as with 1.19.1 and the out-of-the-box performance is somewhere between 1.18.1 and 1.19.1. Running UCX 1.20.0 with UCX_RNDV_THRESH=142022580 also restores the good UCX 1.18.1 performance.
I guess we do not want to set the RNDV_THRESH unconditionally because it is a system-specific tuning parameter. Please note that we are bundling a compiled distribution of UCX with our software to run on all kinds of customer systems.
Setup and versions
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.5 (Ootpa)
$ uname -a
Linux snnhpc02n001 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
$ rpm -q rdma-core
rdma-core-55mlnx37-1.55103.x86_64
$ rpm -q libibverbs
libibverbs-55mlnx37-1.55103.x86_64
$ ofed_info -s
MLNX_OFED_LINUX-5.5-1.0.3.2: