Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: In Reactor Add A Mutex To Protect HandlerList Internal Array #1229

Merged
merged 9 commits into from
Sep 6, 2024

Conversation

dgreatwood
Copy link
Contributor

Prompted by @tyler92's suggested PR#1228 "fix: data race on client/server shutdown":

Looking at reactor.cc HandlerList, it does not attempt any multithread protection for its underlying std::array. In this experimental branch, we have added a mutex to HandlerList, and required that HandlerList members that access the array claim that mutex.

Other related changes:

Removed HandlerList default move constructors; mutex has a deleted move constructor.

Removed HandlerList::clone(), an unused method of HandlerList that required a move constructor.

Added a non-const form of "at" in HandlerList; further explanation in the comments.

Guard the mutex for HandlerList::add, removeAll, at, and forEachHandler

Prompted by @tyler92's suggested PR#1228 "fix: data race on
client/server shutdown":

Looking at reactor.cc HandlerList, it does not attempt any multithread
protection for its underlying std::array. In this experimental branch,
we have added a mutex to HandlerList, and required that HandlerList
members that access the array claim that mutex.

Other related changes:

Removed HandlerList default move constructors; mutex has a deleted
move constructor.

Removed HandlerList::clone(), an unused method of HandlerList that
required a move constructor.

Added a non-const form of "at" in HandlerList; further explanation in
the comments.

Guard the mutex for HandlerList::add, removeAll, at, and forEachHandler
std::shared_ptr<Handler> at(size_t index) const
{
std::shared_ptr<Handler> at(size_t index)
{ // It's const, except that the mutex is locked and unlocked
Copy link
Member

@kiplingw kiplingw Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgreatwood, if you want to guarantee to the caller that the method is logically const, you could try leaving it with a const qualifier, but setting the mutex as mutable.

@@ -412,15 +412,18 @@ namespace Pistache::Aio
}

template <typename Func>
void forEachHandler(Func func) const
{
void forEachHandler(Func func)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here. Up to you, but you could optionally keep the const qualifier.

@kiplingw
Copy link
Member

Thanks @dgreatwood. I've added a minor thought. Otherwise it looks fine to me, notwithstanding any feedback @Tachi107 might have.

@dgreatwood
Copy link
Contributor Author

dgreatwood commented Aug 20, 2024 via email

@tyler92
Copy link
Contributor

tyler92 commented Aug 21, 2024

@dgreatwood Thank you, I will check your branch and let you know

// is const in that the class instance is in the same state
// after the call vs. before (mutex is unlocked again before
// non-const "at" exits) - so this is OK.
return(((HandlerList *)this)->at(index));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using C-style casts. In this case I think const_cast would be more appropriate. Also, you could consider @kiplingw's suggestion; making a mutex mutable is a common pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye @Tachi107. I missed that.

@tyler92
Copy link
Contributor

tyler92 commented Aug 21, 2024

@dgreatwood ASAN is happy with your branch. Checking it with TSAN will be more correct, but I think the fix works for the original issue.

@tyler92
Copy link
Contributor

tyler92 commented Aug 21, 2024

Btw, probably someday we can try to enable TSAN here

https://github.com/pistacheio/pistache/blob/master/.github/workflows/linux.yaml#L30

to verify if everything is OK

@tyler92
Copy link
Contributor

tyler92 commented Aug 21, 2024

Probably there is a deadlock between two mutexes - handlers_arr_mutex_ and reg_unreg_mutex_. I don't have any evidence yet, probably later

@dgreatwood
Copy link
Contributor Author

dgreatwood commented Aug 21, 2024 via email

Added debug logging for poller.reg_unreg_mutex_ using the
GUARD_AND_DBG_LOG macro.

Made "std::mutex handlers_arr_mutex_" mutable for easier const
function handling.
@dgreatwood
Copy link
Contributor Author

dgreatwood commented Aug 22, 2024 via email

@tyler92
Copy link
Contributor

tyler92 commented Aug 22, 2024

Hi @dgreatwood thank you for the information, I have TSAN report about the deadlock which is quite clear I think:

WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=2069737)
  Cycle in lock order graph: M0 (0x7b8400002438) => M1 (0x7b8400002408) => M0

  Mutex M1 acquired here while holding mutex M0 in thread T4:
    #0 pthread_mutex_lock <null> (fuzz_server+0x199b8b) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12/bits/gthr-default.h:749:12 (fuzz_server+0x320130) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_mutex.h:100:17 (fuzz_server+0x31ff78) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_mutex.h:229:19 (fuzz_server+0x431d3c) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #4 Pistache::Aio::SyncImpl::HandlerList::at(unsigned long) /pistache/src/common/reactor.cc:364:17 (fuzz_server+0x6255ba) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #5 Pistache::Aio::SyncImpl::HandlerList::at(unsigned long) const /pistache/src/common/reactor.cc:377:47 (fuzz_server+0x6252e8) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #6 Pistache::Aio::SyncImpl::handleFds(std::vector<Pistache::Polling::Event, std::allocator<Pistache::Polling::Event>>) const /pistache/src/common/reactor.cc:289:27 (fuzz_server+0x625d9b) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #7 Pistache::Aio::SyncImpl::runOnce() /pistache/src/common/reactor.cc:245:25 (fuzz_server+0x620036) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #8 Pistache::Aio::SyncImpl::run() /pistache/src/common/reactor.cc:258:17 (fuzz_server+0x6203a8) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #9 Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()::operator()() const /pistache/src/common/reactor.cc:662:27 (fuzz_server+0x6415e8) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #10 void std::__invoke_impl<void, Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()>(std::__invoke_other, Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:61:14 (fuzz_server+0x6413e8) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #11 std::__invoke_result<Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()>::type std::__invoke<Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()>(Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:96:14 (fuzz_server+0x641238) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #12 void std::thread::_Invoker<std::tuple<Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()>>::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:279:13 (fuzz_server+0x641150) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #13 std::thread::_Invoker<std::tuple<Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()>>::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:286:11 (fuzz_server+0x641068) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #14 std::thread::_State_impl<std::thread::_Invoker<std::tuple<Pistache::Aio::AsyncImpl::Worker::run()::'lambda'()>>>::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:231:13 (fuzz_server+0x640cdc) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #15 <null> <null> (libstdc++.so.6+0xdc252) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2)

    Hint: use TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message

  Mutex M0 acquired here while holding mutex M1 in main thread:
    #0 pthread_mutex_lock <null> (fuzz_server+0x199b8b) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12/bits/gthr-default.h:749:12 (fuzz_server+0x320130) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_mutex.h:100:17 (fuzz_server+0x31ff78) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_mutex.h:229:19 (fuzz_server+0x431d3c) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #4 Pistache::Aio::SyncImpl::detachFromReactor(std::shared_ptr<Pistache::Aio::Handler> const&) /pistache/src/common/reactor.cc:122:41 (fuzz_server+0x61f046) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #5 Pistache::Aio::SyncImpl::detachAndRemoveAllHandlers()::'lambda'(std::shared_ptr<Pistache::Aio::Handler>)::operator()(std::shared_ptr<Pistache::Aio::Handler>) const /pistache/src/common/reactor.cc:135:17 (fuzz_server+0x622acf) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #6 void Pistache::Aio::SyncImpl::HandlerList::forEachHandler<Pistache::Aio::SyncImpl::detachAndRemoveAllHandlers()::'lambda'(std::shared_ptr<Pistache::Aio::Handler>)>(Pistache::Aio::SyncImpl::detachAndRemoveAllHandlers()::'lambda'(std::shared_ptr<Pistache::Aio::Handler>)) /pistache/src/common/reactor.cc:420:21 (fuzz_server+0x6225e2) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #7 Pistache::Aio::SyncImpl::detachAndRemoveAllHandlers() /pistache/src/common/reactor.cc:133:23 (fuzz_server+0x61f27b) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #8 Pistache::Aio::AsyncImpl::detachAndRemoveAllHandlers() /pistache/src/common/reactor.cc:520:28 (fuzz_server+0x636a3d) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #9 Pistache::Aio::Reactor::detachAndRemoveAllHandlers() /pistache/src/common/reactor.cc:732:21 (fuzz_server+0x61aac2) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #10 Pistache::Aio::Reactor::~Reactor() /pistache/src/common/reactor.cc:691:9 (fuzz_server+0x61a8cc) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #11 void std::_Destroy<Pistache::Aio::Reactor>(Pistache::Aio::Reactor*) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/stl_construct.h:151:19 (fuzz_server+0x4cca08) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #12 void std::allocator_traits<std::allocator<void>>::destroy<Pistache::Aio::Reactor>(std::allocator<void>&, Pistache::Aio::Reactor*) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/alloc_traits.h:648:4 (fuzz_server+0x4cc7f4) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #13 std::_Sp_counted_ptr_inplace<Pistache::Aio::Reactor, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:613:2 (fuzz_server+0x4cc359) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #14 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:175:2 (fuzz_server+0x22b210) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #15 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:361:4 (fuzz_server+0x22b12f) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #16 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:1071:11 (fuzz_server+0x22ae13) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #17 std::__shared_ptr<Pistache::Aio::Reactor, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:1524:31 (fuzz_server+0x4cd76c) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #18 std::shared_ptr<Pistache::Aio::Reactor>::~shared_ptr() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr.h:175:11 (fuzz_server+0x4b3c78) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #19 Pistache::Tcp::Listener::~Listener() /pistache/src/server/listener.cc:274:5 (fuzz_server+0x4b474a) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #20 Pistache::Http::Endpoint::~Endpoint() /pistache/src/server/endpoint.cc:360:41 (fuzz_server+0x4a0dd2) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #21 LLVMFuzzerTestOneInput /pistache/build/fuzzing/../../tests/fuzzers/fuzz_server.cpp:64:1 (fuzz_server+0x221d5f) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)
    #22 fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) <null> (fuzz_server+0x14f4e0) (BuildId: dfa875f2a84736d29effb9b4566398b47d5efdbb)

Just FYI: this is a fuzzing test (you can see it on the bottom of the stacks) but it doesn't matter right now.

@dgreatwood
Copy link
Contributor Author

dgreatwood commented Aug 22, 2024 via email

dgreatwood and others added 3 commits August 29, 2024 17:26
…_ Instead

Reason for Reversion
====================

handlers_arr_mutex_ and poller.reg_unreg_mutex_ were getting into a
deadlock with each other.
- In Reactor::~Reactor(), HandlerList::forEachHandler (called from
  detachAndRemoveAllHandlers) would lock handlers_arr_mutex_, and then
  detachFromReactor would lock poller.reg_unreg_mutex_.
- Meanwhile, SyncImpl::runOnce() locks poller.reg_unreg_mutex_ and
  then calls SyncImpl::handleFds which then calls HandlerList::at,
  locking handlers_arr_mutex_.
Since the two locks are in opposite order, of course we have a
deadlock if they happen simultaneously.

Further review suggests that handlers_arr_mutex_ can be removed,
provided we use poller.reg_unreg_mutex_ to guard the dangerous
accesses to the handlers_ list.

That's the strategy we've implemented on this commit.

Specifics
=========

handlers_arr_mutex_ removed

In os.h, reg_unreg_mutex_ made mutable so it can be locked in const
functions.

In reactor.cc:

In SyncImpl::addHandler we lock poller.reg_unreg_mutex_ (addHandler
calls handlers_.add, which of course modifies handlers_ and so
requires protection).

In SyncImpl::detachFromReactor, remove lock of
poller.reg_unreg_mutex_. We are now locking poller.reg_unreg_mutex_ in
detachAndRemoveAllHandlers instead, detachAndRemoveAllHandlers being
the function that calls detachFromReactor.

As above, in detachAndRemoveAllHandlers, lock
poller.reg_unreg_mutex_. reg_unreg_mutex_ must be locked here since
detachAndRemoveAllHandlers calls handlers_.forEachHandler, which
requires handlers_ to be protected.

Remove mutex locks from HandlerList functions "add", "removeAll",
"at", and "forEachHandler". Also removed the HandlerList function
"operator[]" (a synonym of "at") since it is not used and, I found,
it made it harder to browse the code.
@kiplingw kiplingw marked this pull request as draft August 30, 2024 01:14
@dgreatwood
Copy link
Contributor Author

dgreatwood commented Aug 30, 2024 via email

@tyler92
Copy link
Contributor

tyler92 commented Aug 30, 2024

@dgreatwood Hi! I will try to check new commit nearest 1-2 days

@tyler92
Copy link
Contributor

tyler92 commented Aug 31, 2024

@dgreatwood now it is much better. ASAN and TSAN are much more happy than before. I have a couple of warnings from TSAN that are unrelated to the current issue/fix so I will report them as separate issues.

dgreatwood and others added 3 commits September 5, 2024 06:56
EventMethEpollEquivImpl::ctlEx now locks interest_mutex_ when
performing its add/mod/del actions.

Previously, the interest_mutex_ lock was released before these
add/mod/del actions were performed. This was confirmed to occasionally
cause a "use after free" in findEmEventInAnInterestSet (called out of
getReadyEmEvents) in another thread. It might also result in
corruption of the interest_ std::set.
Until the 1.0 release, setting the soname to version_major.version_minor
gives us more freedom to make changes without being too restricted by
ABI compatibility.

We never actually really commited to ABI stability before the 1.0
relase - the README states that at the bottom - but I made this decision
clearer by updating the "versioning" section.
@kiplingw kiplingw marked this pull request as ready for review September 5, 2024 18:45
@kiplingw kiplingw merged commit 7f1b84c into pistacheio:master Sep 6, 2024
125 of 132 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants