Skip to content

More btl uct updates that fix various issues discovered on our hardware #13094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Feb 12, 2025

Fixes one of the remaining issues I see on our hardware. With these commits it runs cleanly with RMA-MT.

Note the second commit is not really a functional change but a cleanup change.

This commit fixes a regression I introduced when changing how the endpoint flags
are set. There is no guarantee when the remote completion message comes in that
the local endpoint has been connected (only the remote endpoint). Instead of
always setting the endpoint as ready here the code now only sets that the remote
is connected and then sets the endpoint as ready only if both the local and
remote endpoints are connected. When the local endpoint is connected it
similarly checks that the remote is connected before setting the endpoint as
ready.

This was verified using RMA-MT with 128 threads and it does fix the race.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Connection endpoints are created to assist with connection connect-to-endpoint
UCT endpoints. In most situations it is created and destroyed for each thread
context but it may be used by multiple simultaneous threads. The code retains
and releases the endpoint at multiple points where needed. This is overkill.
This commit updates the code to retain once per connecting thread (or the
original creation) and then releases when the connection is complete.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
@hjelmn hjelmn merged commit 9fb2d4e into open-mpi:main Feb 12, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants