Skip to content

Performance regression due to "UCP/PROTO: Consider RNDV_PERF_DIFF" #11091

@mkre

Description

@mkre

We are trying to upgrade from UCX 1.18.1 to 1.19.1 and the changes in #10401 (identified via git bisect) are causing significant performance regressions for many of our test cases.

Running with UCX_PROTO_INFO=y shows the following:

UCX 1.18.1

[1767348583.013901] [snnhpc02n001:609154:0]   +--------------------------------+--------------------------------------------------------------------+
[1767348583.013935] [snnhpc02n001:609154:0]   | ucp_context_0 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from generic host memory    |
[1767348583.013941] [snnhpc02n001:609154:0]   +--------------------------------+------------------------------------------------------+-------------+
[1767348583.013948] [snnhpc02n001:609154:0]   |                        0..8248 | eager copy-in copy-out                               | sysv/memory |
[1767348583.013953] [snnhpc02n001:609154:0]   |                8249..142022580 | multi-frag eager copy-in copy-out                    | sysv/memory |
[1767348583.013958] [snnhpc02n001:609154:0]   |                 142022581..inf | (?) rendezvous fragmented copy-in copy-out           | sysv/memory |
[1767348583.013963] [snnhpc02n001:609154:0]   +--------------------------------+------------------------------------------------------+-------------+

UCX 1.19.1

[1767348838.306910] [snnhpc02n001:614374:0]   +--------------------------------+--------------------------------------------------------------------+
[1767348838.306944] [snnhpc02n001:614374:0]   | ucp_context_0 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from generic host memory    |
[1767348838.306951] [snnhpc02n001:614374:0]   +--------------------------------+------------------------------------------------------+-------------+
[1767348838.306958] [snnhpc02n001:614374:0]   |                        0..8248 | eager copy-in copy-out                               | sysv/memory |
[1767348838.306963] [snnhpc02n001:614374:0]   |                   8249..238621 | multi-frag eager copy-in copy-out                    | sysv/memory |
[1767348838.306967] [snnhpc02n001:614374:0]   |                    238622..inf | (?) rendezvous fragmented copy-in copy-out           | sysv/memory |
[1767348838.306972] [snnhpc02n001:614374:0]   +--------------------------------+------------------------------------------------------+-------------+

Running UCX 1.19.1 with UCX_RNDV_THRESH=142022580 restores the UCX 1.18.1 performance, so it appears like the tuning heuristics made a wrong decision.

With UCX 1.20.0, the thresholds are still the same as with 1.19.1 and the out-of-the-box performance is somewhere between 1.18.1 and 1.19.1. Running UCX 1.20.0 with UCX_RNDV_THRESH=142022580 also restores the good UCX 1.18.1 performance.

I guess we do not want to set the RNDV_THRESH unconditionally because it is a system-specific tuning parameter. Please note that we are bundling a compiled distribution of UCX with our software to run on all kinds of customer systems.

Setup and versions

$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.5 (Ootpa)
$ uname -a
Linux snnhpc02n001 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
$ rpm -q rdma-core
rdma-core-55mlnx37-1.55103.x86_64
$ rpm -q libibverbs
libibverbs-55mlnx37-1.55103.x86_64
$ ofed_info -s
MLNX_OFED_LINUX-5.5-1.0.3.2:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions