-
Notifications
You must be signed in to change notification settings - Fork 912
Description
Thank you for taking the time to submit an issue!
Background information
Setting btl_ofi_disable_sep=1 mtl_ofi_enable_sep=0
does not disable OFI libfabric from using multiple PSM2 contexts per endpoint.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.1, 4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Compiled from official source tarball.
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
n/a
Please describe the system on which you are running
- Operating system/version: CentOS 7
- Computer hardware: Intel x86_64
- Network type: Intel Omni-path
Details of the problem
Programs that executed fine with Open MPI releases < 4.1 crash when launched with local ranks > 50% of the CPU cores on an otherwise empty node. Enabling PSM tracing shows the ranks initially allocating 2 PSM contexts per rank, then 1 context per rank, and finally failing to allocate a PSM context at all (with errors opening /dev/hfi1, for example).
Compiled Open MPI 4.1.2 with OFI libfabric 1.9 and 1.13, both exhibited the same behavior.
Using -mca btl_ofi_disable_sep 1 -mca mtl_ofi_enable_sep 0
for the job also did not alter the behavior: there were still multiple PSM contexts allocated per rank.
The OFI libfabric library's PSM2 module has in its psmx2_init_lib()
function the following:
/* turn on multi-ep feature, but don't overwrite existing setting */
setenv("PSM2_MULTI_EP", "1", 0);
So despite the settings for btl_ofi_disable_sep
and mtl_ofi_enable_sep
in Open MPI, if PSM2_MULTI_EP
is unset in the environment libfabric will force it to being enabled in PSM API calls. It looks like the client of OFI libfabric is expected to make its intentions re: Scalable EndPoint (SEP) known via PSM2_MULTI_EP
before calling the libfabric API, otherwise libfabric assumes SEP is desirable. But the PSM2 docs clearly state that if SEP is enabled, PSM context sharing is implicitly disabled and the PSM context maximum is a hard limit.
The presence of the OFI BTL and MTL in Open MPI complicates matters because they each possess a unique MCA parameter controlling use of SEP — and the default values differ (enabled for BTL, disabled for MTL). In situations where the OFI BTL and MTL are both available, it is possible for both to be configured and attempt to allocate OFI endpoints, with differing SEP settings.
The conditions necessary to properly make use of SEP thus require stringent control over rank distribution over nodes, coordinated selection of BTL and MTL plugin, etc. Since both plugins include a parameter to control SEP, the plugin itself should be setting PSM2_MULTI_EP
accordingly to prevent OFI libfabric from dictating the behavior. As such, the circumstance whereby both OFI BTL and MTL get configured in an MPI runtime needs to hold the SEP enable/disable in common to ensure a singular behavior via PSM2_MULTI_EP
.
I did note in a presentation re: the 4.1.0 release that the OFI BTL is not typically used in P2P scenarios, but the fact of the matter is that default builds of these releases against OFI libfabric do not disable the OFI BTL component, thus the default runtime produced possesses both the OFI BTL and MTL and will be susceptible to these issues. It seems like the ideal situation — barring the removal of the OFI BTL, for example — is to have a baseline OFI module for any functionality shared by the OFI BTL and MTL (like a singular SEP setting and associated setenv of PSM2_MULTI_EP
) and reengineer the BTL and MTL to make use of the common OFI module.