Skip to content

Fix various bugs in get/put fallbacks in pml/ob1 #12817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Sep 19, 2024

These were discovered when using pml/ob1 with btl/uct on a new system. I have not been working on Open MPI in some time so it is hard to tell exactly when these bugs were introduced but they could be quite old. See the commit messages for more details.

Fixes #10545

…allback

Under a number of circumstances it may be necessary to abandon an RDMA get in
ob1. In some cases it falls back to put but it may fall back to using send/recv.
If that happens then we may either crash or leak RDMA fragments because they
are still attached to the send request. Debug builds will crash due to a check
on rdma_frag when they are returned. This CL fixes the flaw by releasing any
rdma fragment when sceduling sends.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
If a put or get operation fails it may later be retried by
mca_pml_ob1_process_pending_rdma which increments retries on each new attempt.
There is a flaw in the code where both the put and get failures also increment
this counter leading to it giving up twice as fast. This commit removes the
increments on the put and get failures.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
…n failure

The mca_pml_ob1_recv_request_get_frag_failed method is responsible for returning
or queueing the fragment but mca_pml_ob1_rget_completion was freeing it
unconditionally. This will lead to a double return of the fragment to the free
list and may lead to other errors if the fragment was queued for retry. This
commit fixes the issue by only returning the fragment if it did not fail.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
@bosilca bosilca merged commit cb3890a into open-mpi:main Sep 20, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intermittent hangs, crashes and assert fails with ob1+uct
2 participants