-
Notifications
You must be signed in to change notification settings - Fork 928
PML/UCX: blocked calls optimizations #5569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PML/UCX: blocked calls optimizations #5569
Conversation
|
@yosefe could you review? |
ompi/mca/pml/ucx/pml_ucx.c
Outdated
| if (status != UCS_INPROGRESS) { | ||
| mca_pml_ucx_set_recv_status_safe(mpi_status, status, &info); | ||
| return OMPI_SUCCESS; | ||
| while (1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we reuse MCA_COMMON_UCX_WAIT_LOOP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added macro for progress loop
ompi/mca/pml/ucx/pml_ucx.c
Outdated
| if (ucp_msg != NULL) { | ||
| mca_pml_ucx_set_recv_status_safe(mpi_status, UCS_OK, &info); | ||
| return OMPI_SUCCESS; | ||
| while (1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also update iprobe/improbe/mprobe
the non-blocking versions should trigger opal_progress 1/progress_iterations of the times [if not found a match]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't get - add loop for UCX/opal progress calling into non-blocking calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
today non-blocking probe calls opal_progress
we want it to call (g_count % progress_iterations) ? ucx_progress : opal_progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will require additional static counter inside of routine, is it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the counter does not even have to be thread-safe for correctness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
bot:mellanox:retest |
|
@yosefe could you look? |
ompi/mca/pml/ucx/pml_ucx.c
Outdated
| mca_pml_ucx_set_recv_status_safe(mpi_status, UCS_OK, &info); | ||
| } else { | ||
| opal_progress(); | ||
| ucx_progress_cntr = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make it unsigned and let it wrap around? then no need to reset to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset is used to count "progress iterations" after every completed request to avoid unpredictable calls of opal_progress.
do you think it is bad idea?
ompi/mca/pml/ucx/pml_ucx.c
Outdated
| int mca_pml_ucx_iprobe(int src, int tag, struct ompi_communicator_t* comm, | ||
| int *matched, ompi_status_public_t* mpi_status) | ||
| { | ||
| static int ucx_progress_cntr = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's just inside this function, we rename it to "progress_count"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to progress_count
ompi/mca/pml/ucx/pml_ucx.c
Outdated
| int *matched, struct ompi_message_t **message, | ||
| ompi_status_public_t* mpi_status) | ||
| { | ||
| static int ucx_progress_cntr = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static unsigned progress count = 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
opal/mca/common/ucx/common_ucx.h
Outdated
| /* progress loop to allow call UCX/opal progress */ | ||
| /* used C99 for-statement variable initialization */ | ||
| #define MCA_COMMON_UCX_PROGRESS_LOOP(_worker) \ | ||
| for (int iter = 0;; (++iter % opal_common_ucx.progress_iterations) ? \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned iter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
opal/mca/common/ucx/common_ucx.h
Outdated
| ucs_status_t status; \ | ||
| /* call UCX progress */ \ | ||
| MCA_COMMON_UCX_PROGRESS_LOOP(_worker) { \ | ||
| if (UCS_INPROGRESS != (status = opal_common_ucx_request_status(_request))) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's split this to 2 lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ompi/mca/pml/ucx/pml_ucx.c
Outdated
| *matched = 1; | ||
| mca_pml_ucx_set_recv_status_safe(mpi_status, UCS_OK, &info); | ||
| ucx_progress_cntr = 0; | ||
| progress_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
ompi/mca/pml/ucx/pml_ucx.c
Outdated
| *matched = 1; | ||
| mca_pml_ucx_set_recv_status_safe(mpi_status, UCS_OK, &info); | ||
| ucx_progress_cntr = 0; | ||
| progress_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
- added UCX progress priority Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
- refactoring of opal/UCX progress calls Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
- renamed of internal variable names - used unsigned datatypes Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
42bf169 to
c201c0a
Compare
|
@hoopoepg need this one also in v4.0.x |
Signed-off-by: Sergey Oblomov sergeyo@mellanox.com