Skip to content

OSC/UCX: preserve the accumulate ordering for overlapping buffers during acc-lock less epocs and setting a proper wpool context mutex type #11178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 9, 2022

Conversation

MamziB
Copy link
Contributor

@MamziB MamziB commented Dec 7, 2022

OSC/UCX: preserve the accumulate ordering for overlapping buffers during acc-lock less epocs and setting a proper wpool context mutex type

Signed-off-by: Mamzi Bayatpour mbayatpour@nvidia.com
Co-authored-by: Tomislav Janjusic tomislavj@nvidia.com

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @MamziB! Just a few questions/hints in the comments.

acc-lock less epocs

Signed-off-by: Mamzi Bayatpour  <mbayatpour@nvidia.com>
Co-authored-by: Tomislav Janjusic <tomislavj@nvidia.com>
@MamziB MamziB force-pushed the mamzi/outstanding-nb-acc branch from 5c3059c to e3c3391 Compare December 8, 2022 20:07
@MamziB
Copy link
Contributor Author

MamziB commented Dec 8, 2022

@devreal Thanks for the comments. Please take a look at the updated commit.

that this object has been already getting constructed using
opal_recursive_mutex_t inside opal_common_ucx_wpctx_create, but inside
opal_common_ucx_ctx_t, it was missing the proper type.

Signed-off-by: Mamzi Bayatpour  <mbayatpour@nvidia.com>
Co-authored-by: Tomislav Janjusic <tomislavj@nvidia.com>
@MamziB MamziB changed the title OSC/UCX: preserve the accumulate ordering for overlapping buffers during acc-lock less epocs OSC/UCX: preserve the accumulate ordering for overlapping buffers during acc-lock less epocs and setting a proper wpool context mutex type Dec 8, 2022
@janjust janjust requested a review from devreal December 8, 2022 21:48

if (!op_added) {
/* no more space so flush */
ret = opal_common_ucx_ctx_flush(module->ctx, OPAL_COMMON_UCX_SCOPE_WORKER, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to hold the ctx mutex during the flush?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the ctx mutex is recursive, so it should be ok. it does not hang.

@devreal
Copy link
Contributor

devreal commented Dec 9, 2022

Thanks, it good now. Two more thoughts though:

  1. Should we store the rank with the base and tail to avoid false positives and allow us to flush to endpoint instead of the worker?
  2. Instead of resetting the array after a flush, it might be more efficient to reset that particular entry in the completion callback. That probably wouldn't even need the mutex in the callback since we're only resetting two integer values to 0 (the window of opportunity for a race is small and even if there is one in the worst we don't recognize that entry as being empty). And there is a good chance that it would reduce the need for flushes since entries are continually reset once operations complete.

Those are optimizations though and we could do them as a follow-up. I'll approve what we have now 👍

@janjust
Copy link
Contributor

janjust commented Dec 9, 2022

@devreal I'll open an issue to track the suggestions as a follow up - thanks for the comments.

@janjust janjust merged commit 293cf4b into open-mpi:main Dec 9, 2022
@janjust
Copy link
Contributor

janjust commented Dec 9, 2022

@MamziB v5.0.x backport please, when you can.

@MamziB
Copy link
Contributor Author

MamziB commented Dec 9, 2022

@devreal Thanks for the suggestions. My comments are as follows:

  1. Yes, we could do that. However, the current implementation of osc ucx flush mandates that all the nonblocking accumulates finish before the flush is done (we flush the worker for that). Therefore, I am not sure if we gain much for changing to flush from target ep to worker. But for sure we need to investigate this suggestion's performance benefits.

  2. Actually, this is similar to how I was envisioning the design at the first glance. Adding the mem info to nonblocking req and clearing off the corresponding mem from the array when req is completed. However, since accumulate already has its own lock, this patch is only relevant when we do not have acc lock (meaning only if the application calls MPI_Win_lock/lockall with EXCLUSIVE). Therefore, for the time being, I tried to isolate the mem handling from all other parts of the code. The benefit of the current patch is that all other types of RMA SYNCs will not be impacted by this patch. It also simplifies the osc accumulate design.

#11184

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants