-
Notifications
You must be signed in to change notification settings - Fork 68
rcache: fix deadlock in multi-threaded environments #1171
Conversation
:bot🏷️bug |
Backport was non-trivial but this the commit is passing the IBM patcher test. If @bosilca could also test with his reproducer we will be good to go. |
Test FAILed. |
I can test this right now, I'll try to do it over the weekend. |
efc7f02
to
a2f02bd
Compare
Test FAILed. |
Test FAILed. |
Back port errors on vader. Fix coming. |
a2f02bd
to
bc347a3
Compare
Test FAILed. |
Test FAILed. |
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Back-port of open-mpi/ompi@ab8ed17 Fixes open-mpi/ompi#1654. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
bc347a3
to
78f4315
Compare
Test FAILed. |
False mlnx failure. |
@jladd-mlnx Do you know why mlnx jenkins fails so often with an error that it can't find ucx? |
@bosilca Did you have a chance to review / test? |
@Di0gen could you check the ucx install? |
Yes, I run few tens of tests and none of them deadlocked. I noticed some performance issues, but I can't guarantee they are related to our memory allocator. I will start another test campaign later tonight, so I will have some results for tomorrow's call. |
@bosilca Thanks! |
Test FAILed. |
bot:retest |
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. |
XL failure is due to a problem with an old compiler version on the machine. Disregard that failure. I'm working with the sysadmins on a fix, then I'll turn those CI tests back on. |
Per discussion 2016-05-24 teleconf: we'll take this. Just waiting for CI. |
Test PASSed. |
The IBM XLC failure has been explained, and it @hppritcha tells me that NERSC is offline (which is why the LANL-Cray-XC test is stalled). We'll go ahead and merge. |
Test FAILed. |
This commit fixes several bugs in the registration cache code:
cause an infinite loop if more than 100 registrations are
associated with a munmapped region. This happens because the
mca_rcache_base_vma_find_all function returns the same 100
registrations on each call. This has been fixed by adding an
iterate function to the vma tree interface.
there may be other threads in the system even if
opal_using_threads() is false. Additionally, since it is safe to do
so (the vma lock is recursive) the vma interface has been made
thread safe.
conditions with locks held outside the Open MPI code.
Back-port of open-mpi/ompi@ab8ed17
Fixes open-mpi/ompi#1654.
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov