-
Notifications
You must be signed in to change notification settings - Fork 897
OSC/UCX: explicitly flush default endpoints #13216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@janjust this should fix the Graph500 simple test bus error/ hang during MPI_Finalize. |
@@ -1162,6 +1165,29 @@ int ompi_osc_ucx_free(struct ompi_win_t *win) { | |||
return ret; | |||
} | |||
|
|||
if (!opal_common_ucx_thread_enabled && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two problems with this conditional collective, and they both lead to deadlocks.
opal_common_ucx_thread_enabled
is inheriting the value of the thread level fromMPI_Thread_init
, and in MPI processes on the same communicator can have different levels of thread support. Thus, the true branch of this conditional can be taken by some processes, but not all in the communicator, and the barrier added here will deadlock.- imagine a setup where some processes still have active windows while some others don't. The second part of the condition
mca_osc_ucx_component.num_modules == 1
will then be true only on some processes, leading again to the barrier not being called everywhere in the communicator. - Why is this going over all the processes in
MPI_COMM_WORLD
instead of all the processes in the communicator ?
The issue here is that a collective behavior cannot be decided by a non globally-consistent decision process (threading level of last window).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the issue is in MPI_Finalize
as indicated in the log message, why are you adding a fix in win_free
(which is called for every window) ?
I would have suggested to make this code unconditional and move in the the OSC UCX component_close
but unfortunately the OSC components are closed using the ompi_mpi_instance_append_finalize
and I don't know what exactly is the order compared with the PML and COLL components (both needed for this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. In this case, I need to move the Barrier outside of the conditional block.
Why is this going over all the processes in MPI_COMM_WORLD instead of all the processes in the communicator ?
The objective is to ensure that during MPI_Finalize, all communication endpoints are properly flushed. To achieve this, we enforce a barrier both before and after the flush operation across all created endpoints (EPs).
It is important to note that EPs are created dynamically whenever communication occurs between two processes. Therefore, not all entries in the endpoints array are guaranteed to be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I do not have a communicator object during component_finalize to enforce the Barrier, I am trying to flush during the last window free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the commit, please take a look. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am aware of that. That should use the same worker as the default worker, as here we are running single-threaded mpi. However, it seems like that was not enough, and we need to call EP flush individually on each target, too. Let me try to convert Worker flush and just use EP flush, and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the below code to flush, but the finalize hang still occurs.
int comm_size = ompi_comm_size (module->comm);
for (i = 0; i < comm_size; i++) {
assert(module->ctx != 0);
ret = opal_common_ucx_ctx_flush(module->ctx, OPAL_COMMON_UCX_SCOPE_EP, i);
if (OPAL_SUCCESS != ret) {
return ret;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have unfolded this function completely ! First because I don't understand why we need two barriers for each window_free ? And second because I don't understand how this fixes the problem. Let's assume we are using the UCX PML and a collective framework based on point-to-point. How comes that the barrier, which enqueues very short messages into most of the EP without a real synchronization, is not subject to the same issue (aka. pending acks while trying to finalize) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I opened the PR, I was under the impression that default EPs were not being flushed during window_free. However, after further discussion and a deeper dive into the code, it appears that the worker being flushed in window_free is indeed the default worker. So in theory, we should be covered in terms of flushing before MPI_Finalize.
Given that, what @devreal suggested about converting worker flush to EP flush should have worked — but it seems that it doesn’t.
I’ll continue investigating to better understand how this patch is helping to resolve the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it solves it the same way as an external barrier did in the past, it introduces a delay before closing the endpoints allowing the acks to flow back and thus hiding the core issue in most of the cases (but without really solving it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment.
d2d7778
to
0886a6b
Compare
Signed-off-by: Mamzi Bayatpour <mbayatpour@nvidia.com>
0886a6b
to
224168c
Compare
This PR fixes a hang in osc/ucx observed during MPI_Finalize with Graph500 by ensuring no outstanding operations remain on default endpoints before they are destroyed.