Skip to content

Conversation

jiaxiyan
Copy link
Contributor

@jiaxiyan jiaxiyan commented Sep 24, 2025

Share the domain between the MTL and BTL layers to reduce the total number of domains created. This helps avoid hitting system resource limits on platforms with high core counts.

Instead of having the common code allocate a single domain with the superset of all required capabilities, we attempt to reuse an existing fabric and domain if the providers can support MTL’s and BTL’s different capability sets. This approach allows providers that support domain sharing to reuse resources efficiently while still preserving flexibility. If the providers cannot reuse the fabric and domain due to incompatible requirements, separate domains will be created as before.

@jiaxiyan
Copy link
Contributor Author

@bwbarrett @hppritcha Can you review this?

@jiaxiyan jiaxiyan changed the title ofi/btl: Reuse MTL's domain and fabric in BTL ofi: Reuse MTL's domain and fabric in BTL Oct 7, 2025
@jiaxiyan jiaxiyan changed the title ofi: Reuse MTL's domain and fabric in BTL ofi: Share the domain among MTL and BTL Oct 9, 2025
bwbarrett
bwbarrett previously approved these changes Oct 9, 2025
Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reorder the commits so that the domain sharing patch is last. The others are cleaning up bugs that are more likely to be exposed when the sharing code runs (as we've found with our efa provider cleanups).

For the share the domain patch, in the commit message, please add some details on why we did this as opposed to just having the common code allocate a domain with the superset of requirements. Namely, that in this case, if providers can reuse the fabric and domain given the different capability sets, we'll do that. But if providers can't reuse the fabric and domain, we'll happily just create two domains.

Other than that, looks god to me.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add FI_COMPLETION flag to ensure completion entries are generated
for all data transfer operations.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Share the domain between the MTL and BTL layers to reduce the total
number of domains created. This helps avoid hitting system resource
limits on platforms with high core counts.

Instead of having the common code allocate a single domain with the
superset of all required capabilities, we attempt to reuse an existing
fabric and domain if the providers can support MTL’s and BTL’s different
capability sets. This approach allows providers that support domain
sharing to reuse resources efficiently while still preserving
flexibility. If the providers cannot reuse the fabric and domain due to
incompatible requirements, separate domains will be created as before.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants