Add ROCm support to openmpi.py#4132
Conversation
| # remove plain UCC and UCX | ||
| known_dependencies = [d for d in known_dependencies if d not in ('UCX', 'UCC')] | ||
| # replace with rocm versions | ||
| known_dependencies.extend(['HIP', 'UCX-ROCm', 'UCC-ROCm']) |
There was a problem hiding this comment.
UCX-ROCm and UCC-ROCm are built as complete standalone installations rather
than as additional component modules layered on top of the base UCX/UCC builds.
Two problems emerged with the layered approach:
-
When a plain UCX build and a UCX-ROCm component module were both present in the module environment, ucx_info reported the plain UCX configuration (without --with-rocm) because the plain UCX binary took precedence on PATH. The ROCm transport components were built, but the wrong ucx_info was being invoked, making it appear that ROCm support was absent even when it was not.
-
UCC-ROCm depends on UCX. When UCC-ROCm was built as a component module on top of a ROCm-aware UCX, loading it would also pull in the plain UCX module as a listed dependency, which then shadowed the ROCm-aware UCX on PATH and in LD_LIBRARY_PATH. The result was a UCC-ROCm build backed at runtime by a UCX with no ROCm transport support.
Building UCX-ROCm and UCC-ROCm as fully independent installations avoids both conflicts. The trade-off is a larger on-disk footprint, but I think it is acceptable given that these modules are only loaded in ROCm-aware toolchains.
|
@boegelbot please test @ jsc-zen3 |
|
@Thyre: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 4431346672 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot Overview of tested easyconfigs (in order)
Build succeeded for 3 out of 4 (total: 1 hour 32 mins 57 secs) (4 easyconfigs in total) |
|
|
This pull request adds rocm specific dependencies to known_dependencies and sanity check commands to ensure OpenMPI is properly linked to rocm libraries.