Skip to content

Conversation

@bwbarrett
Copy link
Member

Backport of #13415 to the v5.0.x branch. This patch series fixes a couple of places where we weren't conforming to the Libfabric spec, but also adds a bugfix for EFA systems, allowing the BTL and MTL to share the same Fabric and Domain (but still use different endpoints).

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
(cherry picked from commit f65f900)
Add FI_COMPLETION flag to ensure completion entries are generated
for all data transfer operations.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
(cherry picked from commit 15fe246)
Share the domain between the MTL and BTL layers to reduce the total
number of domains created. This helps avoid hitting system resource
limits on platforms with high core counts.

Instead of having the common code allocate a single domain with the
superset of all required capabilities, we attempt to reuse an existing
fabric and domain if the providers can support MTL’s and BTL’s different
capability sets. This approach allows providers that support domain
sharing to reuse resources efficiently while still preserving
flexibility. If the providers cannot reuse the fabric and domain due to
incompatible requirements, separate domains will be created as before.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
(cherry picked from commit 69d2737)
@github-actions github-actions bot added this to the v5.0.8 milestone Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants