Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI/4.1.5-GCC-12.3.0 (foss/2023a) cannot communicate across nodes #18914

Closed
schiotz opened this issue Oct 4, 2023 · 5 comments
Closed

OpenMPI/4.1.5-GCC-12.3.0 (foss/2023a) cannot communicate across nodes #18914

schiotz opened this issue Oct 4, 2023 · 5 comments
Milestone

Comments

@schiotz
Copy link
Contributor

schiotz commented Oct 4, 2023

Hi,

We cannot get OpenMPI/4.1.5-GCC-12.3.0 (foss/2023a) to work. It works fine on a single compute node, but across two or more nodes it fails with the error

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
  • We see this with different codes including a simple mpi4py hello-world script.
  • We see the error both on our OmniPath and Infiniband nodes.
  • OpenMPI works fine with the 2020b, 2022a and 2022b toolchains, although the two latter requires setting OMPI_MCA_btl='^openib,ofi' on the OmniPath nodes.
  • Running mpiexec with -mca pml_base_verbose 10 -mca mtl_base_verbose 10 does not give any extra info, the crash happens before the usual verbose output is printed.
  • A script looping over the various possible values that I know of for OMPI_MCA_btl, OMPI_MCA_mtl and OMPI_MCA_pml resulted in the same crash regardless of the values, which is consistent with the crash happening before the verbose output from selecting these is printed.
  • As mentioned above, everything works fine within a single compute node, "only" multinode jobs are affected.

CC: @OleHolmNielsen

@branfosj
Copy link
Member

branfosj commented Oct 4, 2023

Is this built using the patched version in #18833?

@branfosj branfosj added this to the next release (4.8.2?) milestone Oct 4, 2023
@schiotz
Copy link
Contributor Author

schiotz commented Oct 4, 2023

I don't think so, can I check that somehow?

It was built last week, but I think it was built using a --from-pr from a recent PR I made that pulled the entire toolchain in. I am not sure if that pulls in already merged stuff (I don't think it does, since it did not pull in a bug fix to util-linux that had already been merged).

EDIT: I'll try building from that patch in my private folder, overriding the system package.

@branfosj
Copy link
Member

branfosj commented Oct 4, 2023

I don't think so, can I check that somehow?

The easyconfig and patches used in the build are in the easybuild directory inside the software directory. So, after loading the OpenMPI module, look for OpenMPI-4.1.5_fix-pmix3x.patch in $EBROOTOPENMPI/easybuild/

Checking my build shows I will need to rebuild, as that patch is not included:

$ ls $EBROOTOPENMPI/easybuild/
easybuild-OpenMPI-4.1.5-20230607.130638.log
easybuild-OpenMPI-4.1.5-20230607.130638_test_report.md
OpenMPI-4.1.1_build-with-internal-cuda-header.patch
OpenMPI-4.1.1_opal-datatype-cuda-performance.patch
OpenMPI-4.1.5-GCC-12.3.0-easybuild-devel
OpenMPI-4.1.5-GCC-12.3.0.eb
reprod

@schiotz
Copy link
Contributor Author

schiotz commented Oct 4, 2023

Thank you very much, @branfosj

The patch is not included, so we'll rebuild. I'll report back here, and close the issue if it helps.

@schiotz
Copy link
Contributor Author

schiotz commented Oct 4, 2023

And I can confirm that merged PR #18833 does indeed solve this. Thank you @branfosj for pointing me in the right direction.

@schiotz schiotz closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants