Description
Currently, Open MPI executable (like mpicc/mpirun/etc.) inherit dependencies on BTL communication libraries, like ugni. This leads to problems on some large scale systems, where compute not libraries are not in the default library search paths on the head nodes (I assume they must be available, otherwise linking of applications wouldn't work). For example:
mpicc --version
mpicc: error while loading shared libraries:
libugni.so.0: cannot open shared object file: No such file or directory
This has been a problem with Open MPI since the BTLs moved to OPAL, but is considerably more noticeable with the change to avoid building DSOs by default. #8800 proposed a fix by making components with external dependencies build as DSOs by default, but this defeats the entire reason we build without DSOs by default. Launch scalability with the old behavior was terrible because of the mass DSO loading at launch. The systems likely to run into the library dependency problem are the very ones that need the change in default behavior, and are likely to have many components with external dependencies.
The right solution is probably to move the BTLs back into the OMPI layer, but I assume @bosilca will object to that plan. A second plan, and likely the one we will have to implement, is to split OPAL into two libraries. The first is just the base portability code (with minimal MCA inclusion) that is safe to use on the front-end and the second is the full opal with communication libraries. We already have a bit of this split, in that we have two different initialization routines (opal_init()
and opal_init_util()
). We just don't expose that split through libraries, leading to linking problems.