4.3.2: Crashing use-after-free in poller_base_t::execute_timers #3645

Steve-Read-Stormshield · 2019-08-29T09:47:48Z

Issue description

While poller_base_t::execute_timers is walking the collection of timers that are expiring / have already expired, a different thread cancels the specific timer whose timer_event method is about to be called. In my environment, the heap functions including new/delete will fill blocks with 0x5A when they are freed.

The crash happens at line 103 of poller_base.cpp:

   it->second.sink->timer_event (it->second.id);

when the code attempts to dereference it->second.sink.

Environment

libzmq version (commit hash if unreleased): 4.3.2
The issue did not occur with 4.2.3.
OS: FreeBSD 11.2 / amd64
Multicore CPU

Calling language is C++ via cppzmq. I only have one thread, so there are no issues surrounding my code using libzmq objects from more than one thread. At the moment of the crash, there are three threads:

A libzmq I/O thread
My main worker thread, executing my code via boost::asio
A third thread, also in libzmq, that I'd guess is the reaper.

Minimal test code / Steps to reproduce the issue

I don't have a clear picture of what exactly causes this. I can't tell what kind of object was just removed from the timer multimap because the iterator points to deleted memory (see note above about deleted blocks being filled with 0x5A). My best guess is a PUB or SUB socket that's being disconnected or unbound from my code at the same moment as the TCP connection to its peer is about to be reconnected, and the poller_base_t is about to dispatch to the timer_event handler.

At certain times, my program will find that its ZeroMQ sockets are disconnected by network events at about the same time that those network events cause a notification to my program to change which endpoints it uses (unbind-then-bind on one socket, disconnect-then-connect on another).

There are also zmq IPC sockets, but they are only rarely changed while the program is running, whereas the network events mentioned above are expected to happen (even if we hope they don't), and they will cause the disconnect/connect and unbind/bind sequences.

What's the actual result? (include assertion message & call stack if applicable)

Thread 3 (process 100288):  ***Reaper?***
#0  _kevent () at _kevent.S:3
#1  0x000000080427ccc2 in __thr_kevent (kq=13, changelist=0x0, nchanges=0, eventlist=0x7fffdfffbf30, nevents=256, timeout=0x0)
    at /home/build/snsbsd/lib/libthr/thread/thr_syscalls.c:398
#2  0x00000008025fabc3 in zmq::kqueue_t::loop (this=0x805e18500) at kqueue.cpp:190
#3  0x00000008026283df in thread_routine (arg_=0x805e18540) at thread.cpp:225
#4  0x0000000804279c06 in thread_start (curthread=0x805e10200) at /home/build/snsbsd/lib/libthr/thread/thr_create.c:289
#5  0x0000000000000000 in ?? ()
Current language:  auto; currently asm

Thread 2 (process 100282):  ***My worker thread***
#0  _kevent () at _kevent.S:3
#1  0x000000080427ccc2 in __thr_kevent (kq=3, changelist=0x0, nchanges=0, eventlist=0x7fffffffd370, nevents=128, timeout=0x7fffffffe780)
    at /home/build/snsbsd/lib/libthr/thread/thr_syscalls.c:398
#2  0x0000000800ac0b27 in boost::asio::detail::kqueue_reactor::run () from /usr/Firewall/lib/libnfw++.so
#3  0x0000000800ac056d in boost::asio::detail::scheduler::do_run_one () from /usr/Firewall/lib/libnfw++.so
#4  0x0000000800ac01b2 in boost::asio::detail::scheduler::run () from /usr/Firewall/lib/libnfw++.so
#5  0x0000000800abb913 in sns::DaemonBase::run () from /usr/Firewall/lib/libnfw++.so
#6  0x00000000004097e8 in main ()

Thread 1 (process 100289):  *** Crashed thread***
#0  zmq::poller_base_t::execute_timers (this=0x805e18700) at poller_base.cpp:103
#1  0x00000008025fab58 in zmq::kqueue_t::loop (this=0x805e18700) at kqueue.cpp:177
#2  0x00000008026283df in thread_routine (arg_=0x805e18740) at thread.cpp:225
#3  0x0000000804279c06 in thread_start (curthread=0x805e10e00) at /home/build/snsbsd/lib/libthr/thread/thr_create.c:289
#4  0x0000000000000000 in ?? ()

What's the expected result?

Does not crash.

The text was updated successfully, but these errors were encountered:

sigiesec · 2019-12-23T12:09:58Z

I can imagine that the changes made by fa976f8 changed the behaviour here subtly.

Could you try to revert the changes made by that commit to poller_base.cpp and see if this fixes your issue?

However, the code didn't seem to be correct before. It assumed already that _timers are not modified during execute_timers, which does not seem to be true (maybe it was true at some point of time, but at least cbb3b17 introduced a call to add_timer from within a timer_event function. I didn't check if there are other such uses. I think the code in _poller_base_t::execute_timers needs to be changed to properly handle concurrent modifications to _timers.

rhymerjr · 2020-04-28T08:34:21Z

Hi,

I'm using ZMQ_CLIENT and libzmq version: 4.3.2 and I have a very similar symptom!

Thread 4 "ZMQbg/IO/0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5e9e700 (LWP 12634)]
0x0000000000000000 in ?? ()
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff7bcc17e in zmq::poller_base_t::execute_timers (this=this@entry=0x555555666f40) at src/poller_base.cpp:103
#2 0x00007ffff7bb7635 in zmq::epoll_t::loop (this=0x555555666f40) at src/epoll.cpp:173
#3 0x00007ffff7bec9c8 in thread_routine (arg_=0x555555666f98) at src/thread.cpp:369
#4 0x00007ffff7c30fa3 in start_thread (arg=) at pthread_create.c:486
#5 0x00007ffff74f94cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Is there any solution for this ?

bjovke · 2020-05-01T23:48:44Z

Hi.
Didn't comment here for a long time. :)

The issue is that the class which created the timer calls cancel_timer () and then deletes itself with delete this;. The sink is the pointer to the class and it then points to garbage.
For example zmq::stream_engine_base_t::error (error_reason_t reason_) calls unplug () which calls cancel_timer (). Then it does delete this;.
It's a little bit more complicated when one class creates two or more timers active at the same time.

cancel_timer () called from execute_timers () is safe as long as it is not called on timer for which it->second.sink->timer_event (it->second.id); is executing (should never happen on expired timers anyway). add_timer () is completely safe. But, all this is safe only with proper locking to prevent concurrent access to _timers container.

I guess the simple mutex locking on all three cancel_timer (), execute_timers () and add_timer () would be enough?
I'll make a commit in my fork and let you know when I finish so you can test it.

bjovke · 2020-05-02T16:12:25Z

@SteveReadSNS , @rhymerjr Can you please try with the latest commit from https://github.com/bjovke/libzmq.git?
A mutex is added to synchronize access to _timers container.

Steve-Read-Stormshield · 2020-05-04T07:32:47Z

Sorry, I've been busy with other things recently. Based on the analysis of my colleague who has also worked on this, I'd say that a mutex is not going to solve this problem in my case, because the accesses to the container are all from the same thread (ZMQ's I/O thread - in my case there's only one I/O thread).

Steve-Read-Stormshield · 2020-05-04T07:38:46Z

That's not to say that a mutex isn't important, just that if there's only one I/O thread, it won't help in most cases, and I suspect it shouldn't help at all in any case. The reaper shouldn't be manipulating these objects while they are still have timers open in an I/O thread, and it shouldn't be possible to have one I/O thread manipulating objects that are in another I/O thread's timer queue.

bjovke · 2020-05-04T11:45:46Z

Might be, but then again it might not.
It would be much better if we had a sample program which can reproduce the issue. In this way we're guessing what the issue might be.
Can you try it with the mutex stuff? You don't need to clone the whole https://github.com/bjovke/libzmq.git repository, just copy the two changed files and rebuild.

bjovke · 2020-05-05T07:36:48Z

Hi.
@SteveReadSNS can you please check what is the value of it->second.id when crash happens at line it->second.sink->timer_event (it->second.id);? This would narrow down the search for the issue.

ebourneuf · 2020-05-07T14:39:51Z

Hi,

I 'am a SteveReadSNS collegue, and I will follow this issue.
@bjovke This issue occured a long time ago, and we do not have anymore the cores to inspect de values of it->second.id and reproducing the issue was not easy.

The most important thing, is that the iteration on the map is not safe, because its content may be changed by the called it->second.sink->timer_event() callback

bjovke · 2020-05-08T14:51:50Z

Hi @ebourneuf . Sorry for the late reply.
The iteration of the map is safe as long as it->second.sink->timer_event() does not delete the exact timer it is triggered on.

std::multimap

References and iterators to the erased elements are invalidated. Other references and iterators are not affected.

This is also valid when new timers are inserted, none of the existing iterators are invalidated.
The problem is if multiple threads might access AND change the map at the same time. That's not safe.
This can be fixed with synchronized access (mutex).
Another issue is that some of the base classes which created two timers might delete only one and delete itself, leaving it->second.sink pointing to garbage. This is hard to figure out (at least by me) and working example to reproduce the issue would be of paramount help.

I'm wondering, since you don't have this issue any more (as I understood), why is this an issue for you now?

ebourneuf · 2020-05-08T15:53:32Z

I don't have this issue any more because we fixed the function with the following patch
and we would like to integrate it in the libzmq

patch-poller_base.cpp-cancel_timer.txt

bjovke · 2020-05-08T20:30:38Z

Great. I get the idea, this is the safest way. I'll try to make a patch based on this.
But it would be nice if we can test it.
@rhymerjr Is there a chance that you can test the change as soon as I make it?

bjovke · 2020-05-08T23:02:59Z

OK, I've made the changes at https://github.com/bjovke/libzmq.git.
How can we test if this is the right fix?
@rhymerjr ?

rhymerjr · 2020-05-09T21:24:41Z

Yes! A have changed the poller_base.cpp by the patch attached by ebourneuf!
I started running tests for this! But it may take a few days!
My client application has always crashed by this problem within a few days...

bjovke · 2020-05-10T19:41:26Z

Great! Please let us know when you have some results.

rhymerjr · 2020-05-18T09:22:34Z

After more than a week I can say that the crash has gone!
I have started 15 copies from my application and they all still in running state!
(each application instance is under very intense stress testing)
Looks like this problem has solved by that patch!

bjovke · 2020-05-18T12:46:28Z

Great!
Thank you very much for testing this.
I'll make a pull request.

To avoid a use-after-free crash in `execute_timers` that showed up when running the test suite with valgrind. Refs: zeromq#3645

To avoid a use-after-free crash in `execute_timers` that showed up when running the test suite with valgrind. Refs: zeromq#3645 PR-URL: #6 Reviewed-by: Trevor Norris <trev.norris@gmail.com>

…out <= 0

sigiesec added Area (Runtime / Usage) Platform (freebsd) Symptom (Crash/Race/Undefined behavior) Regression labels Dec 23, 2019

bjovke mentioned this issue May 18, 2020

Problem: Usage of invalidated iterator of _timers container in zmq::p… #3925

Merged

sigiesec closed this as completed in #3925 May 20, 2020

santigimeno added a commit to nodesource/libzmq that referenced this issue Apr 8, 2021

build: changes to support v4.3.4

e83a94a

To avoid a use-after-free crash in `execute_timers` that showed up when running the test suite with valgrind. Refs: zeromq#3645

santigimeno mentioned this issue Apr 8, 2021

build: changes to support v4.3.4 nodesource/libzmq#6

Closed

santigimeno added a commit to nodesource/libzmq that referenced this issue Apr 8, 2021

build: changes to support v4.3.4

53dd4b7

To avoid a use-after-free crash in `execute_timers` that showed up when running the test suite with valgrind. Refs: zeromq#3645

axelriet added a commit to axelriet/libzmq that referenced this issue Dec 23, 2023

Follow-up to issue zeromq#3645 being fixed, plus debug assert on time…

3215578

…out <= 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4.3.2: Crashing use-after-free in poller_base_t::execute_timers #3645

4.3.2: Crashing use-after-free in poller_base_t::execute_timers #3645

Steve-Read-Stormshield commented Aug 29, 2019

sigiesec commented Dec 23, 2019 •

edited

Loading

rhymerjr commented Apr 28, 2020

bjovke commented May 1, 2020

bjovke commented May 2, 2020

Steve-Read-Stormshield commented May 4, 2020

Steve-Read-Stormshield commented May 4, 2020

bjovke commented May 4, 2020

bjovke commented May 5, 2020

ebourneuf commented May 7, 2020

bjovke commented May 8, 2020

ebourneuf commented May 8, 2020

bjovke commented May 8, 2020

bjovke commented May 8, 2020

rhymerjr commented May 9, 2020

bjovke commented May 10, 2020

rhymerjr commented May 18, 2020

bjovke commented May 18, 2020

4.3.2: Crashing use-after-free in poller_base_t::execute_timers #3645

4.3.2: Crashing use-after-free in poller_base_t::execute_timers #3645

Comments

Steve-Read-Stormshield commented Aug 29, 2019

Issue description

Environment

Minimal test code / Steps to reproduce the issue

What's the actual result? (include assertion message & call stack if applicable)

What's the expected result?

sigiesec commented Dec 23, 2019 • edited Loading

rhymerjr commented Apr 28, 2020

bjovke commented May 1, 2020

bjovke commented May 2, 2020

Steve-Read-Stormshield commented May 4, 2020

Steve-Read-Stormshield commented May 4, 2020

bjovke commented May 4, 2020

bjovke commented May 5, 2020

ebourneuf commented May 7, 2020

bjovke commented May 8, 2020

ebourneuf commented May 8, 2020

bjovke commented May 8, 2020

bjovke commented May 8, 2020

rhymerjr commented May 9, 2020

bjovke commented May 10, 2020

rhymerjr commented May 18, 2020

bjovke commented May 18, 2020

sigiesec commented Dec 23, 2019 •

edited

Loading