Skip to content

Add ROCm support to openmpi.py#4132

Open
zerefwayne wants to merge 1 commit into
easybuilders:developfrom
zerefwayne:openmpi-rocm
Open

Add ROCm support to openmpi.py#4132
zerefwayne wants to merge 1 commit into
easybuilders:developfrom
zerefwayne:openmpi-rocm

Conversation

@zerefwayne
Copy link
Copy Markdown
Contributor

This pull request adds rocm specific dependencies to known_dependencies and sanity check commands to ensure OpenMPI is properly linked to rocm libraries.

# remove plain UCC and UCX
known_dependencies = [d for d in known_dependencies if d not in ('UCX', 'UCC')]
# replace with rocm versions
known_dependencies.extend(['HIP', 'UCX-ROCm', 'UCC-ROCm'])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCX-ROCm and UCC-ROCm are built as complete standalone installations rather
than as additional component modules layered on top of the base UCX/UCC builds.

Two problems emerged with the layered approach:

  1. When a plain UCX build and a UCX-ROCm component module were both present in the module environment, ucx_info reported the plain UCX configuration (without --with-rocm) because the plain UCX binary took precedence on PATH. The ROCm transport components were built, but the wrong ucx_info was being invoked, making it appear that ROCm support was absent even when it was not.

  2. UCC-ROCm depends on UCX. When UCC-ROCm was built as a component module on top of a ROCm-aware UCX, loading it would also pull in the plain UCX module as a listed dependency, which then shadowed the ROCm-aware UCX on PATH and in LD_LIBRARY_PATH. The result was a UCC-ROCm build backed at runtime by a UCX with no ROCm transport support.

Building UCX-ROCm and UCC-ROCm as fully independent installations avoids both conflicts. The trade-off is a larger on-disk footprint, but I think it is acceptable given that these modules are only loaded in ROCm-aware toolchains.

Copy link
Copy Markdown
Collaborator

@Thyre Thyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Thyre
Copy link
Copy Markdown
Collaborator

Thyre commented May 12, 2026

@boegelbot please test @ jsc-zen3
EB_ARGS="--installpath /tmp/$USER/ebpr-4132 OpenMPI-4.1.4-GCC-11.3.0.eb OpenMPI-5.0.10-llvm-compilers-21.1.8.eb OpenMPI-5.0.3-intel-compilers-2024.2.0.eb OpenMPI-4.1.6-GCC-13.2.0.eb"

@boegelbot
Copy link
Copy Markdown

@Thyre: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=4132 EB_ARGS="--installpath /tmp/$USER/ebpr-4132 OpenMPI-4.1.4-GCC-11.3.0.eb OpenMPI-5.0.10-llvm-compilers-21.1.8.eb OpenMPI-5.0.3-intel-compilers-2024.2.0.eb OpenMPI-4.1.6-GCC-13.2.0.eb" EB_CONTAINER= EB_REPO=easybuild-easyblocks EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_4132 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 10412

Test results coming soon (I hope)...

Details

- notification for comment with ID 4431346672 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Copy Markdown

Test report by @boegelbot

Overview of tested easyconfigs (in order)

Build succeeded for 3 out of 4 (total: 1 hour 32 mins 57 secs) (4 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.25
See https://gist.github.com/boegelbot/a36bf09b09cfc6879502f83ce8b2679a for a full test report.

@Thyre
Copy link
Copy Markdown
Collaborator

Thyre commented May 12, 2026

FAIL: opal_path_nfs is not unsurprising, so that's certainly unrelated to the changes here.

@Thyre Thyre added this to the next release (5.3.1?) milestone May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants