Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

rcache: fix deadlock in multi-threaded environments #1171

Merged
merged 1 commit into from
May 24, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented May 19, 2016

This commit fixes several bugs in the registration cache code:

  • Fix a programming error in the grdma invalidation function that can
    cause an infinite loop if more than 100 registrations are
    associated with a munmapped region. This happens because the
    mca_rcache_base_vma_find_all function returns the same 100
    registrations on each call. This has been fixed by adding an
    iterate function to the vma tree interface.
  • Always obtain the vma lock when needed. This is required because
    there may be other threads in the system even if
    opal_using_threads() is false. Additionally, since it is safe to do
    so (the vma lock is recursive) the vma interface has been made
    thread safe.
  • Avoid calling free() while holding a lock. This avoids race
    conditions with locks held outside the Open MPI code.

Back-port of open-mpi/ompi@ab8ed17

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

:bot🏷️bug
:bot🏷️blocker
:bot:assign: @bosilca
:bot:milestone:v2.0.0

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

Backport was non-trivial but this the commit is passing the IBM patcher test. If @bosilca could also test with his reproducer we will be good to go.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@bosilca
Copy link
Member

bosilca commented May 19, 2016

I can test this right now, I'll try to do it over the weekend.

@hjelmn hjelmn force-pushed the v2.x_patcher_threads branch from efc7f02 to a2f02bd Compare May 19, 2016 16:20
@lanl-ompi
Copy link
Contributor

Test FAILed.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1677/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

Back port errors on vader. Fix coming.

@hjelmn hjelmn force-pushed the v2.x_patcher_threads branch from a2f02bd to bc347a3 Compare May 19, 2016 16:53
@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1678/ for details.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1679/ for details.

This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Back-port of open-mpi/ompi@ab8ed17

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn hjelmn force-pushed the v2.x_patcher_threads branch from bc347a3 to 78f4315 Compare May 19, 2016 17:39
@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1680/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

False mlnx failure.

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

@jladd-mlnx Do you know why mlnx jenkins fails so often with an error that it can't find ucx?

@jsquyres
Copy link
Member

@bosilca Did you have a chance to review / test?

@hppritcha
Copy link
Member

@Di0gen could you check the ucx install?
bot:retest

@bosilca
Copy link
Member

bosilca commented May 23, 2016

Yes, I run few tens of tests and none of them deadlocked. I noticed some performance issues, but I can't guarantee they are related to our memory allocator. I will start another test campaign later tonight, so I will have some results for tomorrow's call.

@jsquyres
Copy link
Member

@bosilca Thanks!

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1694/ for details.

@hppritcha
Copy link
Member

bot:retest

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

@jjhursey
Copy link
Member

XL failure is due to a problem with an old compiler version on the machine. Disregard that failure. I'm working with the sysadmins on a fix, then I'll turn those CI tests back on.

@jsquyres
Copy link
Member

jsquyres commented May 24, 2016

Per discussion 2016-05-24 teleconf: we'll take this. Just waiting for CI.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1700/ for details.

@jsquyres
Copy link
Member

The IBM XLC failure has been explained, and it @hppritcha tells me that NERSC is offline (which is why the LANL-Cray-XC test is stalled). We'll go ahead and merge.

@jsquyres jsquyres merged commit 327c394 into open-mpi:v2.x May 24, 2016
@lanl-ompi
Copy link
Contributor

Test FAILed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Patcher issues reported on mailing list
9 participants