Skip to content

rcache: fix deadlock in multi-threaded environments #1673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 18, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented May 17, 2016

This commit fixes several bugs in the registration cache code:

  • Fix a programming error in the grdma invalidation function that can
    cause an infinite loop if more than 100 registrations are
    associated with a munmapped region. This happens because the
    mca_rcache_base_vma_find_all function returns the same 100
    registrations on each call. This has been fixed by adding an
    iterate function to the vma tree interface.
  • Always obtain the vma lock when needed. This is required because
    there may be other threads in the system even if
    opal_using_threads() is false. Additionally, since it is safe to do
    so (the vma lock is recursive) the vma interface has been made
    thread safe.
  • Avoid calling free() while holding a lock. This avoids race
    conditions with locks held outside the Open MPI code.

Fixes #1654.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Fixes open-mpi#1654.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented May 17, 2016

@bosilca Please test.

@jsquyres
Copy link
Member

@gpaulsen Please also test. Thanks.

@bosilca
Copy link
Member

bosilca commented May 18, 2016

@hjelmn so far so good. Not a single deadlock over few tens of runs with large number of threads. I will switch everything to an optimized mode and start an overnight series of tests. I'll update the status tomorrow.

@bosilca
Copy link
Member

bosilca commented May 18, 2016

👍

@jsquyres jsquyres added the bug label May 18, 2016
@jsquyres jsquyres added this to the v2.0.0 milestone May 18, 2016
@hjelmn hjelmn merged commit 9371a6a into open-mpi:master May 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Patcher issues reported on mailing list
3 participants