-
Notifications
You must be signed in to change notification settings - Fork 905
OSC UCX: Enable software atomicity when lock is set to MPI_MODE_NOCHECK #10493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@MamziB The title (and commit message) seem confusing and make it sound like setting |
In osc/ucx locking windows with MPI_MODE_NOCHECK would disable sw based atomicity in atomic calls such as MPI_Accumulate/MPI_Get_Accumulate/etc. In other words, these calls are not atomic anymore when MPI_MODE_NOCHECK is set (if HW support is not available). |
OK, I get now what the intention of this patch is. The commit message makes it sound like software based atomicity is always enabled if Looking at the standard, though, I wonder whether that is necessary:
Clearly, there should not be the need for the atomic lock if |
Thanks @janjust! The example uses shared locks, which should be covered by the existing test in
So unless we have an exclusive lock we need to take the atomic lock. For shared locks it is true that we need to take the atomic lock, independent of whether This patch adds another check so that if we have an exclusive lock but Sorry for being picky here but a) I'd like to understand what this patch really does (it took 2 round trips to get the bottom of the problem) and b) I wonder if the problem really is in this line or somewhere else. Looking at the code for taking the shared/exclusive lock, I think the problem really is there:
MPI_MODE_NOCHECK with exclusive locks, unnecessarily.
|
@devreal Thanks for the thorough explanation. I think I agree with you, ommiting the check, but setting the appropriate lock alleviates this issue in the first place. |
@devreal my only concern was the combination of MPI_MODE_NOCHECK + Exclusive lock. With our initial commit, we still guarantee the atomicity of the operations. Otherwise, we could just set the SHARED type in ompi/ompi/mca/osc/ucx/osc_ucx_passive_target.c as you proposed. In this test acc_loc.c line 98, if I replace the SHARED with Exclusive, we still fail with your proposed solution. For now, I will change the commit to have support for atomicity only for SHARED + MPI_MODE_NOCHECK. Later we can change it if we see an issue with an application. Thanks a lot, @devreal for all your constructive comments! I appreciate it! Thanks, |
@MamziB As I said in #10493 (comment) there is no need for atomicity for the combination Exclusive Lock + |
e8e0c7f
to
d620afc
Compare
@MamziB There is an additional commit now ("osc/ucx: fix osc ucx component priority selection") that seems unrelated. Was that intentional? |
@@ -124,6 +124,11 @@ int ompi_osc_ucx_lock(int lock_type, int target, int mpi_assert, struct ompi_win | |||
} | |||
} else { | |||
lock->is_nocheck = true; | |||
if (lock_type == MPI_LOCK_EXCLUSIVE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good but the same problem exists in ompi_osc_ucx_lock_all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devreal Good catch. Will add it there too. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devreal I took a look at ompi_osc_ucx_lock_all implementation. Looks like we are not creating any new lock object. Therefore, lock will be NULL inside need_acc_lock(). So we should be good here as if the lock is NULL, need_acc_lock returns true. Please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, I quickly glanced over it and saw this line which looked similar:
module->lock_all_is_nocheck = true; |
Signed-off-by: Mamzi Bayatpour <mbayatpour@nvidia.com> Co-authored-by: Tomislav Janjusic <tomislavj@nvidia.com>
d620afc
to
f7f928d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MamziB, looks good to me now 👍
@@ -124,6 +124,11 @@ int ompi_osc_ucx_lock(int lock_type, int target, int mpi_assert, struct ompi_win | |||
} | |||
} else { | |||
lock->is_nocheck = true; | |||
if (lock_type == MPI_LOCK_EXCLUSIVE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, I quickly glanced over it and saw this line which looked similar:
module->lock_all_is_nocheck = true; |
@devreal thanks for all your input. |
@MamziB , please open up v5.0 PR |
There is #10494 but that has the original change in it. Do we need to port this back to 4.1? |
@devreal , ah right, yes, just need to update it. Yes we should open up v4.1 equivalent, just fyi - at this stage with all the recent PRs v5.0.x is in a much better shape than v4.1.x We'll get to v4.1 once we're done with v5.0.x, some changes are v5.0/main specific due to wpool design |
In the latest round of mpich testings, I found a data validation issue when our nonblocking accumulate is enabled. The test that is failing is lockcontention2.c line 255. The reason for the data validation issue is clear: the test is using exclusive win lock, and therefore, inside the code, we avoid acquiring the acc lock (look at My question is:
Mamzi |
I'm not sure why we have a problem here. The call to |
@devreal yes, we do have a flush. However, here, the issue is that the target rank and the buffer are the same in both back-to-back calls to accumulate. Therefore, validation fails as they read the same target window value, instead of the updated version. |
Oh, now I see what you mean... Yes, that code is correct. The two updates may happen in any order but it is legal to update the same target memory location multiple times with an accumulate. That is indeed a problem for the nonblocking updates. Tracking and deconflicting operations would be one way (if operations overlap we need to wait for the previous ones to complete). Or simply waiting for any previous operation in the same access epoch to complete (saves tracking). It would be nice to be able to have multiple accumulate operations active at the same time though. However, don't we have a similar problem in the blocking implementation, albeit with a smaller window of opportunity? There we are waiting for the GET to complete, but not the PUT. So, the GET in the next accumulate call may read stale data (if the network does not order these operations). |
"don't we have a similar problem in the blocking implementation" I'm thinking of the adding following code to the beginning of the nonblocking accumulate to complete the previous ones. what do you think?
|
Yes, that's one way to solve it. I would prefer a way that preserves more of the benefits of nonblocking accumulates. Your patch would diminish its usefulness in any case where there is more than one accumulate operation. Instead, tracking ranges of operations for each target (using the |
Also, I have to correct an earlier statement I made:
That in any order part was not correct. By default, updates to the same variable have to happen in program order and the |
OSC UCX: Enable software atomicity when lock is set to MPI_MODE_NOCHECK
Signed-off-by: Mamzi Bayatpour mbayatpour@nvidia.com
Co-authored-by: Tomislav Janjusic tomislavj@nvidia.com