Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.3.2: Crashing use-after-free in poller_base_t::execute_timers #3645

Closed
Steve-Read-Stormshield opened this issue Aug 29, 2019 · 17 comments · Fixed by #3925
Closed

4.3.2: Crashing use-after-free in poller_base_t::execute_timers #3645

Steve-Read-Stormshield opened this issue Aug 29, 2019 · 17 comments · Fixed by #3925

Comments

@Steve-Read-Stormshield
Copy link

Issue description

While poller_base_t::execute_timers is walking the collection of timers that are expiring / have already expired, a different thread cancels the specific timer whose timer_event method is about to be called. In my environment, the heap functions including new/delete will fill blocks with 0x5A when they are freed.

The crash happens at line 103 of poller_base.cpp:

   it->second.sink->timer_event (it->second.id);

when the code attempts to dereference it->second.sink.

Environment

  • libzmq version (commit hash if unreleased): 4.3.2
  • The issue did not occur with 4.2.3.
  • OS: FreeBSD 11.2 / amd64
  • Multicore CPU

Calling language is C++ via cppzmq. I only have one thread, so there are no issues surrounding my code using libzmq objects from more than one thread. At the moment of the crash, there are three threads:

  • A libzmq I/O thread
  • My main worker thread, executing my code via boost::asio
  • A third thread, also in libzmq, that I'd guess is the reaper.

Minimal test code / Steps to reproduce the issue

I don't have a clear picture of what exactly causes this. I can't tell what kind of object was just removed from the timer multimap because the iterator points to deleted memory (see note above about deleted blocks being filled with 0x5A). My best guess is a PUB or SUB socket that's being disconnected or unbound from my code at the same moment as the TCP connection to its peer is about to be reconnected, and the poller_base_t is about to dispatch to the timer_event handler.

At certain times, my program will find that its ZeroMQ sockets are disconnected by network events at about the same time that those network events cause a notification to my program to change which endpoints it uses (unbind-then-bind on one socket, disconnect-then-connect on another).

There are also zmq IPC sockets, but they are only rarely changed while the program is running, whereas the network events mentioned above are expected to happen (even if we hope they don't), and they will cause the disconnect/connect and unbind/bind sequences.

What's the actual result? (include assertion message & call stack if applicable)

Thread 3 (process 100288):  ***Reaper?***
#0  _kevent () at _kevent.S:3
#1  0x000000080427ccc2 in __thr_kevent (kq=13, changelist=0x0, nchanges=0, eventlist=0x7fffdfffbf30, nevents=256, timeout=0x0)
    at /home/build/snsbsd/lib/libthr/thread/thr_syscalls.c:398
#2  0x00000008025fabc3 in zmq::kqueue_t::loop (this=0x805e18500) at kqueue.cpp:190
#3  0x00000008026283df in thread_routine (arg_=0x805e18540) at thread.cpp:225
#4  0x0000000804279c06 in thread_start (curthread=0x805e10200) at /home/build/snsbsd/lib/libthr/thread/thr_create.c:289
#5  0x0000000000000000 in ?? ()
Current language:  auto; currently asm

Thread 2 (process 100282):  ***My worker thread***
#0  _kevent () at _kevent.S:3
#1  0x000000080427ccc2 in __thr_kevent (kq=3, changelist=0x0, nchanges=0, eventlist=0x7fffffffd370, nevents=128, timeout=0x7fffffffe780)
    at /home/build/snsbsd/lib/libthr/thread/thr_syscalls.c:398
#2  0x0000000800ac0b27 in boost::asio::detail::kqueue_reactor::run () from /usr/Firewall/lib/libnfw++.so
#3  0x0000000800ac056d in boost::asio::detail::scheduler::do_run_one () from /usr/Firewall/lib/libnfw++.so
#4  0x0000000800ac01b2 in boost::asio::detail::scheduler::run () from /usr/Firewall/lib/libnfw++.so
#5  0x0000000800abb913 in sns::DaemonBase::run () from /usr/Firewall/lib/libnfw++.so
#6  0x00000000004097e8 in main ()

Thread 1 (process 100289):  *** Crashed thread***
#0  zmq::poller_base_t::execute_timers (this=0x805e18700) at poller_base.cpp:103
#1  0x00000008025fab58 in zmq::kqueue_t::loop (this=0x805e18700) at kqueue.cpp:177
#2  0x00000008026283df in thread_routine (arg_=0x805e18740) at thread.cpp:225
#3  0x0000000804279c06 in thread_start (curthread=0x805e10e00) at /home/build/snsbsd/lib/libthr/thread/thr_create.c:289
#4  0x0000000000000000 in ?? ()

What's the expected result?

Does not crash.

@sigiesec
Copy link
Member

sigiesec commented Dec 23, 2019

I can imagine that the changes made by fa976f8 changed the behaviour here subtly.

Could you try to revert the changes made by that commit to poller_base.cpp and see if this fixes your issue?

However, the code didn't seem to be correct before. It assumed already that _timers are not modified during execute_timers, which does not seem to be true (maybe it was true at some point of time, but at least cbb3b17 introduced a call to add_timer from within a timer_event function. I didn't check if there are other such uses. I think the code in _poller_base_t::execute_timers needs to be changed to properly handle concurrent modifications to _timers.

@rhymerjr
Copy link

Hi,

I'm using ZMQ_CLIENT and libzmq version: 4.3.2 and I have a very similar symptom!

Thread 4 "ZMQbg/IO/0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5e9e700 (LWP 12634)]
0x0000000000000000 in ?? ()
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff7bcc17e in zmq::poller_base_t::execute_timers (this=this@entry=0x555555666f40) at src/poller_base.cpp:103
#2 0x00007ffff7bb7635 in zmq::epoll_t::loop (this=0x555555666f40) at src/epoll.cpp:173
#3 0x00007ffff7bec9c8 in thread_routine (arg_=0x555555666f98) at src/thread.cpp:369
#4 0x00007ffff7c30fa3 in start_thread (arg=) at pthread_create.c:486
#5 0x00007ffff74f94cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Is there any solution for this ?

@bjovke
Copy link
Contributor

bjovke commented May 1, 2020

Hi.
Didn't comment here for a long time. :)

The issue is that the class which created the timer calls cancel_timer () and then deletes itself with delete this;. The sink is the pointer to the class and it then points to garbage.
For example zmq::stream_engine_base_t::error (error_reason_t reason_) calls unplug () which calls cancel_timer (). Then it does delete this;.
It's a little bit more complicated when one class creates two or more timers active at the same time.

cancel_timer () called from execute_timers () is safe as long as it is not called on timer for which it->second.sink->timer_event (it->second.id); is executing (should never happen on expired timers anyway). add_timer () is completely safe. But, all this is safe only with proper locking to prevent concurrent access to _timers container.

I guess the simple mutex locking on all three cancel_timer (), execute_timers () and add_timer () would be enough?
I'll make a commit in my fork and let you know when I finish so you can test it.

@bjovke
Copy link
Contributor

bjovke commented May 2, 2020

@SteveReadSNS , @rhymerjr Can you please try with the latest commit from https://github.com/bjovke/libzmq.git?
A mutex is added to synchronize access to _timers container.

@Steve-Read-Stormshield
Copy link
Author

Sorry, I've been busy with other things recently. Based on the analysis of my colleague who has also worked on this, I'd say that a mutex is not going to solve this problem in my case, because the accesses to the container are all from the same thread (ZMQ's I/O thread - in my case there's only one I/O thread).

@Steve-Read-Stormshield
Copy link
Author

That's not to say that a mutex isn't important, just that if there's only one I/O thread, it won't help in most cases, and I suspect it shouldn't help at all in any case. The reaper shouldn't be manipulating these objects while they are still have timers open in an I/O thread, and it shouldn't be possible to have one I/O thread manipulating objects that are in another I/O thread's timer queue.

@bjovke
Copy link
Contributor

bjovke commented May 4, 2020

Might be, but then again it might not.
It would be much better if we had a sample program which can reproduce the issue. In this way we're guessing what the issue might be.
Can you try it with the mutex stuff? You don't need to clone the whole https://github.com/bjovke/libzmq.git repository, just copy the two changed files and rebuild.

@bjovke
Copy link
Contributor

bjovke commented May 5, 2020

Hi.
@SteveReadSNS can you please check what is the value of it->second.id when crash happens at line it->second.sink->timer_event (it->second.id);? This would narrow down the search for the issue.

@ebourneuf
Copy link

Hi,

I 'am a SteveReadSNS collegue, and I will follow this issue.
@bjovke This issue occured a long time ago, and we do not have anymore the cores to inspect de values of it->second.id and reproducing the issue was not easy.

The most important thing, is that the iteration on the map is not safe, because its content may be changed by the called it->second.sink->timer_event() callback

@bjovke
Copy link
Contributor

bjovke commented May 8, 2020

Hi @ebourneuf . Sorry for the late reply.
The iteration of the map is safe as long as it->second.sink->timer_event() does not delete the exact timer it is triggered on.

std::multimap

References and iterators to the erased elements are invalidated. Other references and iterators are not affected.

This is also valid when new timers are inserted, none of the existing iterators are invalidated.
The problem is if multiple threads might access AND change the map at the same time. That's not safe.
This can be fixed with synchronized access (mutex).
Another issue is that some of the base classes which created two timers might delete only one and delete itself, leaving it->second.sink pointing to garbage. This is hard to figure out (at least by me) and working example to reproduce the issue would be of paramount help.

I'm wondering, since you don't have this issue any more (as I understood), why is this an issue for you now?

@ebourneuf
Copy link

I don't have this issue any more because we fixed the function with the following patch
and we would like to integrate it in the libzmq

patch-poller_base.cpp-cancel_timer.txt

@bjovke
Copy link
Contributor

bjovke commented May 8, 2020

Great. I get the idea, this is the safest way. I'll try to make a patch based on this.
But it would be nice if we can test it.
@rhymerjr Is there a chance that you can test the change as soon as I make it?

@bjovke
Copy link
Contributor

bjovke commented May 8, 2020

OK, I've made the changes at https://github.com/bjovke/libzmq.git.
How can we test if this is the right fix?
@rhymerjr ?

@rhymerjr
Copy link

rhymerjr commented May 9, 2020

Yes! A have changed the poller_base.cpp by the patch attached by ebourneuf!
I started running tests for this! But it may take a few days!
My client application has always crashed by this problem within a few days...

@bjovke
Copy link
Contributor

bjovke commented May 10, 2020

Great! Please let us know when you have some results.

@rhymerjr
Copy link

After more than a week I can say that the crash has gone!
I have started 15 copies from my application and they all still in running state!
(each application instance is under very intense stress testing)
Looks like this problem has solved by that patch!

@bjovke
Copy link
Contributor

bjovke commented May 18, 2020

Great!
Thank you very much for testing this.
I'll make a pull request.

santigimeno added a commit to nodesource/libzmq that referenced this issue Apr 8, 2021
To avoid a use-after-free crash in `execute_timers` that showed up when running
the test suite with valgrind.

Refs: zeromq#3645
santigimeno added a commit to nodesource/libzmq that referenced this issue Apr 8, 2021
To avoid a use-after-free crash in `execute_timers` that showed up when running
the test suite with valgrind.

Refs: zeromq#3645
santigimeno added a commit to nodesource/libzmq that referenced this issue Apr 8, 2021
To avoid a use-after-free crash in `execute_timers` that showed up when running
the test suite with valgrind.

Refs: zeromq#3645
PR-URL: #6
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
axelriet added a commit to axelriet/libzmq that referenced this issue Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants