Skip to content

OSC UCX: Enable software atomicity when lock is set to MPI_MODE_NOCHECK #10493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 23, 2022

Conversation

MamziB
Copy link
Contributor

@MamziB MamziB commented Jun 21, 2022

OSC UCX: Enable software atomicity when lock is set to MPI_MODE_NOCHECK

Signed-off-by: Mamzi Bayatpour mbayatpour@nvidia.com
Co-authored-by: Tomislav Janjusic tomislavj@nvidia.com

@devreal
Copy link
Contributor

devreal commented Jun 21, 2022

@MamziB The title (and commit message) seem confusing and make it sound like setting MPI_MODE_NOCHECK leads to atomics being emulated in software. AFAICS, your patch adds support for MPI_MODE_NOCHECK and checks for whether taking the atomics lock is actually necessary if atomic operations are done in software. Could please clarify your commit message?

@MamziB
Copy link
Contributor Author

MamziB commented Jun 21, 2022

In osc/ucx locking windows with MPI_MODE_NOCHECK would disable sw based atomicity in atomic calls such as MPI_Accumulate/MPI_Get_Accumulate/etc. In other words, these calls are not atomic anymore when MPI_MODE_NOCHECK is set (if HW support is not available).
By adding this patch, even if the window lock is acquired with MPI_MODE_NOCHECK we guarantee SW atomicity for these kinds of atomic calls.

FWIW, we saw this issue in a simple mpich test which does:
Win_Win_lock(MPI_NOCHECK);
MPI_Accumulate();
MPI_Win_unlock();

@devreal
Copy link
Contributor

devreal commented Jun 22, 2022

OK, I get now what the intention of this patch is. The commit message makes it sound like software based atomicity is always enabled if MPI_MODE_NOCHECK is set, even if HW support is usable.

Looking at the standard, though, I wonder whether that is necessary:

MPI_MODE_NOCHECK — no other process holds, or will attempt to acquire, a conflicting lock, while the caller holds the window lock. This is useful when mutual exclusion is achieved by other means, but the coherence operations that may be attached to the lock and unlock calls are still required.

Clearly, there should not be the need for the atomic lock if MPI_MODE_NOCHECK is set because there cannot not be a conflicting lock on another peer (and thus, no conflicting atomic operations). Can you point me to the test in MPICH?

@janjust
Copy link
Contributor

janjust commented Jun 22, 2022

@devreal acc_loc.c

Even though we don't lock the window, we still need to ensure atomicity of the operation.
Ensuring atomicity doesn't lock the window, though it does serialize operations.

@devreal
Copy link
Contributor

devreal commented Jun 22, 2022

Thanks @janjust! The example uses shared locks, which should be covered by the existing test in need_acc_lock. The test as it is today is:

!(NULL != lock && LOCK_EXCLUSIVE == lock->type);

So unless we have an exclusive lock we need to take the atomic lock. For shared locks it is true that we need to take the atomic lock, independent of whether MPI_MODE_NOCHECK is set.

This patch adds another check so that if we have an exclusive lock but MPI_MODE_NOCHECK is set we still need to take the atomic lock. The standard says that if you have an exclusive lock and use MPI_MODE_NOCHECK then there cannot be conflicting atomic operations because no other process is allowed to take a conflicting lock, so no need for to take the atomic lock.

Sorry for being picky here but a) I'd like to understand what this patch really does (it took 2 round trips to get the bottom of the problem) and b) I wonder if the problem really is in this line or somewhere else. Looking at the code for taking the shared/exclusive lock, I think the problem really is there:

lock->is_nocheck = true;
. Note that the type of the lock is not set there. If it was, we wouldn't need the patch proposed here (for the reasons outlined above). This patch would void the benefit of MPI_MODE_NOCHECK with exclusive locks, unnecessarily.

@janjust
Copy link
Contributor

janjust commented Jun 22, 2022

@devreal Thanks for the thorough explanation. I think I agree with you, ommiting the check, but setting the appropriate lock alleviates this issue in the first place.

@MamziB
Copy link
Contributor Author

MamziB commented Jun 22, 2022

@devreal my only concern was the combination of MPI_MODE_NOCHECK + Exclusive lock. With our initial commit, we still guarantee the atomicity of the operations. Otherwise, we could just set the SHARED type in ompi/ompi/mca/osc/ucx/osc_ucx_passive_target.c as you proposed.

In this test acc_loc.c line 98, if I replace the SHARED with Exclusive, we still fail with your proposed solution.
However, I did not find anywhere in the standard saying explicitly that usage of MPI_MODE_NOCHECK + Exclusive lock will discard the atomicity of an operation such as MPI_Accumulate. Please kindly let me know where to read this.

For now, I will change the commit to have support for atomicity only for SHARED + MPI_MODE_NOCHECK. Later we can change it if we see an issue with an application.

Thanks a lot, @devreal for all your constructive comments! I appreciate it!

Thanks,
Mamzi

@devreal
Copy link
Contributor

devreal commented Jun 22, 2022

@MamziB As I said in #10493 (comment) there is no need for atomicity for the combination Exclusive Lock + MPI_MODE_NOCHECK. If you replace the shared lock with an exclusive lock in the test case the code is an incorrect MPI application.

@MamziB MamziB force-pushed the mamzi/atomic-acc-nocheck branch from e8e0c7f to d620afc Compare June 22, 2022 20:18
@devreal
Copy link
Contributor

devreal commented Jun 22, 2022

@MamziB There is an additional commit now ("osc/ucx: fix osc ucx component priority selection") that seems unrelated. Was that intentional?

@@ -124,6 +124,11 @@ int ompi_osc_ucx_lock(int lock_type, int target, int mpi_assert, struct ompi_win
}
} else {
lock->is_nocheck = true;
if (lock_type == MPI_LOCK_EXCLUSIVE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good but the same problem exists in ompi_osc_ucx_lock_all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devreal Good catch. Will add it there too. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devreal I took a look at ompi_osc_ucx_lock_all implementation. Looks like we are not creating any new lock object. Therefore, lock will be NULL inside need_acc_lock(). So we should be good here as if the lock is NULL, need_acc_lock returns true. Please let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, I quickly glanced over it and saw this line which looked similar:

module->lock_all_is_nocheck = true;
but that's the module, not the lock.

@MamziB
Copy link
Contributor Author

MamziB commented Jun 22, 2022

@devreal no, I am not sure how that went into. I created a new commit, then squashed it into the previous commit of PR. Im trying to figure out how to fix it.

EDIT: The commit is now resolved. Waiting for @devreal approval.

Signed-off-by: Mamzi Bayatpour  <mbayatpour@nvidia.com>
Co-authored-by: Tomislav Janjusic <tomislavj@nvidia.com>
@MamziB MamziB force-pushed the mamzi/atomic-acc-nocheck branch from d620afc to f7f928d Compare June 22, 2022 20:59
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MamziB, looks good to me now 👍

@@ -124,6 +124,11 @@ int ompi_osc_ucx_lock(int lock_type, int target, int mpi_assert, struct ompi_win
}
} else {
lock->is_nocheck = true;
if (lock_type == MPI_LOCK_EXCLUSIVE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, I quickly glanced over it and saw this line which looked similar:

module->lock_all_is_nocheck = true;
but that's the module, not the lock.

@janjust
Copy link
Contributor

janjust commented Jun 23, 2022

@devreal thanks for all your input.

@devreal devreal merged commit eba4e33 into open-mpi:main Jun 23, 2022
@janjust
Copy link
Contributor

janjust commented Jun 23, 2022

@MamziB , please open up v5.0 PR

@devreal
Copy link
Contributor

devreal commented Jun 23, 2022

There is #10494 but that has the original change in it. Do we need to port this back to 4.1?

@janjust
Copy link
Contributor

janjust commented Jun 23, 2022

@devreal , ah right, yes, just need to update it. Yes we should open up v4.1 equivalent, just fyi - at this stage with all the recent PRs v5.0.x is in a much better shape than v4.1.x We'll get to v4.1 once we're done with v5.0.x, some changes are v5.0/main specific due to wpool design

@MamziB
Copy link
Contributor Author

MamziB commented Dec 2, 2022

@devreal @janjust

In the latest round of mpich testings, I found a data validation issue when our nonblocking accumulate is enabled. The test that is failing is lockcontention2.c line 255.

The reason for the data validation issue is clear: the test is using exclusive win lock, and therefore, inside the code, we avoid acquiring the acc lock (look at ompi_osc_need_acc_lock function). Since we are enabling the nonblocking design of MPI_Win_Accumulate, this leads to a data validation issue.

My question is:

  • Do you think this test's behavior is correct? Seems like the test assumes that MPI_Win_Accumulate is a blocking operation. If so, I think we might have to get the acc lock, even if the window lock is exclusive.

Mamzi

@devreal
Copy link
Contributor

devreal commented Dec 2, 2022

I'm not sure why we have a problem here. The call to MPI_WIN_LOCK should include a flush (must complete outstanding operations). Is that not the case with the nonblocking accumulates?

@MamziB
Copy link
Contributor Author

MamziB commented Dec 2, 2022

@devreal yes, we do have a flush. However, here, the issue is that the target rank and the buffer are the same in both back-to-back calls to accumulate. Therefore, validation fails as they read the same target window value, instead of the updated version.

@devreal
Copy link
Contributor

devreal commented Dec 2, 2022

Oh, now I see what you mean... Yes, that code is correct. The two updates may happen in any order but it is legal to update the same target memory location multiple times with an accumulate. That is indeed a problem for the nonblocking updates. Tracking and deconflicting operations would be one way (if operations overlap we need to wait for the previous ones to complete). Or simply waiting for any previous operation in the same access epoch to complete (saves tracking). It would be nice to be able to have multiple accumulate operations active at the same time though.

However, don't we have a similar problem in the blocking implementation, albeit with a smaller window of opportunity? There we are waiting for the GET to complete, but not the PUT. So, the GET in the next accumulate call may read stale data (if the network does not order these operations).

@MamziB
Copy link
Contributor Author

MamziB commented Dec 2, 2022

@devreal

"don't we have a similar problem in the blocking implementation"
No, in one of the previous commits, I added a flush at the end of the accumulate to make sure we avoid such scenarios.

I'm thinking of the adding following code to the beginning of the nonblocking accumulate to complete the previous ones. what do you think?

if (ompi_osc_need_acc_lock(module, target) == false) {
        ret = opal_common_ucx_ctx_flush(module->ctx, OPAL_COMMON_UCX_SCOPE_EP, target);
        if (ret != OMPI_SUCCESS) {
            return ret;
         }
}

@devreal
Copy link
Contributor

devreal commented Dec 4, 2022

Yes, that's one way to solve it. I would prefer a way that preserves more of the benefits of nonblocking accumulates. Your patch would diminish its usefulness in any case where there is more than one accumulate operation. Instead, tracking ranges of operations for each target (using the mca_rcache_base_vma infrastructure) and only flushing if a conflict occurs would provide better performance in many applications (where accumulates are spread over several variables). It's a bit more effort but I think it would be worth it.

@devreal
Copy link
Contributor

devreal commented Dec 4, 2022

Also, I have to correct an earlier statement I made:

Yes, that code is correct. The two updates may happen in any order but it is legal to update the same target memory location multiple times with an accumulate.

That in any order part was not correct. By default, updates to the same variable have to happen in program order and the "accumulate_ordering" info key can be used to lift that constraint.

@MamziB
Copy link
Contributor Author

MamziB commented Dec 7, 2022

@devreal Thanks for the comments. #11178 fixes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants