-
Notifications
You must be signed in to change notification settings - Fork 927
ofi: Share the domain among MTL and BTL #13415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@bwbarrett @hppritcha Can you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reorder the commits so that the domain sharing patch is last. The others are cleaning up bugs that are more likely to be exposed when the sharing code runs (as we've found with our efa provider cleanups).
For the share the domain patch, in the commit message, please add some details on why we did this as opposed to just having the common code allocate a domain with the superset of requirements. Namely, that in this case, if providers can reuse the fabric and domain given the different capability sets, we'll do that. But if providers can't reuse the fabric and domain, we'll happily just create two domains.
Other than that, looks god to me.
Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add FI_COMPLETION flag to ensure completion entries are generated for all data transfer operations. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Share the domain between the MTL and BTL layers to reduce the total number of domains created. This helps avoid hitting system resource limits on platforms with high core counts. Instead of having the common code allocate a single domain with the superset of all required capabilities, we attempt to reuse an existing fabric and domain if the providers can support MTL’s and BTL’s different capability sets. This approach allows providers that support domain sharing to reuse resources efficiently while still preserving flexibility. If the providers cannot reuse the fabric and domain due to incompatible requirements, separate domains will be created as before. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Share the domain between the MTL and BTL layers to reduce the total number of domains created. This helps avoid hitting system resource limits on platforms with high core counts.
Instead of having the common code allocate a single domain with the superset of all required capabilities, we attempt to reuse an existing fabric and domain if the providers can support MTL’s and BTL’s different capability sets. This approach allows providers that support domain sharing to reuse resources efficiently while still preserving flexibility. If the providers cannot reuse the fabric and domain due to incompatible requirements, separate domains will be created as before.