-
Notifications
You must be signed in to change notification settings - Fork 898
OSC/UCX: explicitly flush default endpoints #13216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
MamziB
wants to merge
1
commit into
open-mpi:main
Choose a base branch
from
MamziB:mamzi/osc-ucx-finalize
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two problems with this conditional collective, and they both lead to deadlocks.
opal_common_ucx_thread_enabled
is inheriting the value of the thread level fromMPI_Thread_init
, and in MPI processes on the same communicator can have different levels of thread support. Thus, the true branch of this conditional can be taken by some processes, but not all in the communicator, and the barrier added here will deadlock.mca_osc_ucx_component.num_modules == 1
will then be true only on some processes, leading again to the barrier not being called everywhere in the communicator.MPI_COMM_WORLD
instead of all the processes in the communicator ?The issue here is that a collective behavior cannot be decided by a non globally-consistent decision process (threading level of last window).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the issue is in
MPI_Finalize
as indicated in the log message, why are you adding a fix inwin_free
(which is called for every window) ?I would have suggested to make this code unconditional and move in the the OSC UCX
component_close
but unfortunately the OSC components are closed using theompi_mpi_instance_append_finalize
and I don't know what exactly is the order compared with the PML and COLL components (both needed for this PR).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. In this case, I need to move the Barrier outside of the conditional block.
Why is this going over all the processes in MPI_COMM_WORLD instead of all the processes in the communicator ?
The objective is to ensure that during MPI_Finalize, all communication endpoints are properly flushed. To achieve this, we enforce a barrier both before and after the flush operation across all created endpoints (EPs).
It is important to note that EPs are created dynamically whenever communication occurs between two processes. Therefore, not all entries in the endpoints array are guaranteed to be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I do not have a communicator object during component_finalize to enforce the Barrier, I am trying to flush during the last window free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the commit, please take a look. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am aware of that. That should use the same worker as the default worker, as here we are running single-threaded mpi. However, it seems like that was not enough, and we need to call EP flush individually on each target, too. Let me try to convert Worker flush and just use EP flush, and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the below code to flush, but the finalize hang still occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have unfolded this function completely ! First because I don't understand why we need two barriers for each window_free ? And second because I don't understand how this fixes the problem. Let's assume we are using the UCX PML and a collective framework based on point-to-point. How comes that the barrier, which enqueues very short messages into most of the EP without a real synchronization, is not subject to the same issue (aka. pending acks while trying to finalize) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I opened the PR, I was under the impression that default EPs were not being flushed during window_free. However, after further discussion and a deeper dive into the code, it appears that the worker being flushed in window_free is indeed the default worker. So in theory, we should be covered in terms of flushing before MPI_Finalize.
Given that, what @devreal suggested about converting worker flush to EP flush should have worked — but it seems that it doesn’t.
I’ll continue investigating to better understand how this patch is helping to resolve the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it solves it the same way as an external barrier did in the past, it introduces a delay before closing the endpoints allowing the acks to flow back and thus hiding the core issue in most of the cases (but without really solving it).