Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zmq core dumps in epoll fd set manipulation on linux #2103

Closed
aggarwaa opened this issue Sep 8, 2016 · 23 comments
Closed

zmq core dumps in epoll fd set manipulation on linux #2103

aggarwaa opened this issue Sep 8, 2016 · 23 comments

Comments

@aggarwaa
Copy link

aggarwaa commented Sep 8, 2016

Greetings,

I have been using zmq PUSH/PULL sockets for communication between multiple nodes in a distributed system for over a year now without any problems. Typically, the PULL socket gets connected to by 40 servers that PUSH to it -- note that this is done over the Internet, with some of these socket connections being across continents.

Lately, we have started running into crashes which seem to be as a result of frequent reconnects. The stack traces are of the form:

(gdb) bt 
#0 0x00007f98d7ea9c37 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 
#1 0x00007f98d7ead028 in __GI_abort () at abort.c:89
#2 0x00007f98da244f39 in zmq::zmq_abort (
errmsg_=errmsg_@entry=0x7f98d7fefbc7 "Bad file descriptor") at err.cpp:74 
#3 0x00007f98da2448ae in zmq::epoll_t::add_fd (this=0x7f98a6c240c0,
fd_=<optimized out>, events_=events_@entry=0x7f988784f800) at epoll.cpp:67 
#4 0x00007f98da2456bc in zmq::io_object_t::add_fd (
this=this@entry=0x7f988784f800, fd_=<optimized out>) at io_object.cpp:55 
#5 0x00007f98da262814 in zmq::stream_engine_t::plug (this=0x7f988784f800,
io_thread_=0x7f98a6c201e0, session_=0x7f988784a780) 
at stream_engine.cpp:135 
#6 0x00007f98da259c0a in zmq::session_base_t::process_attach (
this=0x7f988784a780, engine_=0x7f988784f810) at session_base.cpp:363 
#7 0x00007f98da24d07a in zmq::object_t::process_command (this=0x7f988784a780,
cmd_=...) at object.cpp:89 
#8 0x00007f98da24592c in zmq::io_thread_t::in_event (this=0x7f98a6c201e0)
at io_thread.cpp:73 
#9 0x00007f98da244bae in zmq::epoll_t::loop (this=0x7f98a6c240c0)
at epoll.cpp:165 
#10 0x00007f98da2691e0 in thread_routine (arg_=0x7f98a6c24130) at thread.cpp:81
#11 0x00007f98da791184 in start_thread (arg=0x7f989f3ff700)
at pthread_create.c:312 
#12 0x00007f98d7f6d37d in clone ()

---Type <return> to continue, or q <return> to quit--- 
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 

zmq epoll.cpp code snippet in question:

 52 zmq::epoll_t::handle_t zmq::epoll_t::add_fd (fd_t fd_, i_poll_events *events_) 
 53 { 
 54 poll_entry_t *pe = new (std::nothrow) poll_entry_t; 
 55 alloc_assert (pe); 
 56 
 57 // The memset is not actually needed. It's here to prevent debugging 
 58 // tools to complain about using uninitialised memory. 
 59 memset (pe, 0, sizeof (poll_entry_t)); 
 60 
 61 pe->fd = fd_; 
 62 pe->ev.events = 0; 
 63 pe->ev.data.ptr = pe; 
 64 pe->events = events_; 
 65 
 66 int rc = epoll_ctl (epoll_fd, EPOLL_CTL_ADD, fd_, &pe->ev); <----------- returned EBADF 
 67 errno_assert (rc != -1); <--------------- assert that caused the crash 
 68 
 69 // Increase the load metric of the thread. 
 70 adjust_load (1); 
 71 
 72 return pe; 
 73 } 

What I think is happening is - zmq was adding an fd (sockfd specifically since those are the only ones relevant in our case) to it's epoll fd set being watched by one of the I/O threads but because the fd that was bad/had already been closed; epoll_ctl failed with an EBADF which causes zmq to assert. In other cores, it can be seen adding/removing/modifying the set of epoll fds -- some of which fail due to the same race. The asserts hit are usually in zmq::epoll_t::add_fd()/rm_fd()/set_pollin()/reset_pollin()/set_pollout()/reset_pollout().

(gdb) f 5 
#5 0x00007f98da262814 in zmq::stream_engine_t::plug (this=0x7f988784f800,
io_thread_=0x7f98a6c201e0, session_=0x7f988784a780) 
at stream_engine.cpp:135 

135 stream_engine.cpp: No such file or directory. 
(gdb) p options 
$1 = {sndhwm = 1000, rcvhwm = 1000, affinity = 0, identity_size = 0 '\000', 
  identity = '\000' <repeats 255 times>, rate = 100, recovery_ivl = 10000, 
  multicast_hops = 1, sndbuf = 0, rcvbuf = 0, type = 7, linger = -1, 
  reconnect_ivl = 100, reconnect_ivl_max = 0, backlog = 100, maxmsgsize = -1, 
  rcvtimeo = 500, sndtimeo = -1, ipv6 = false, immediate = 0, filter = false, 
  recv_identity = false, raw_sock = false, tcp_keepalive = -1, 
  tcp_keepalive_cnt = -1, tcp_keepalive_idle = -1, tcp_keepalive_intvl = -1, 
  tcp_accept_filters = std::vector of length 0, capacity 0, mechanism = 0, 
  as_server = 0, zap_domain = "", plain_username = "", plain_password = "", 
  curve_public_key = '\000' <repeats 31 times>, 
  curve_secret_key = '\000' <repeats 31 times>, 
  curve_server_key = '\000' <repeats 31 times>, socket_id = 4, 
  conflate = false} 
(gdb) p options.type 
$2 = 7 

What is type '7'?

zmq.h:#define ZMQ_PULL 7 

It a ZMQ_PULL socket. Also, as can be seen from the options, the reconnect_ivl is set to the default of 100 milliseconds.

Questions: a) Is it safe to remove the asserts in the said routines (above)? If so, I can provide a patchset against master.
b) I'm going to attempt to "work around" this issue by increasing the ZMQ_RECONNECT_IVL to 1 second on the PUSH side (sender side). Do you agree with that?
c) Is there any other way I can workaround this issue?

FWIW, If my analysis is correct, even the current zmq version 4.1.5 is subject to the same race so upgrading libzmq isn't going to help me here.

Version of zmq being used: 4.0.4
OS Version: Ubuntu Linux 14.04.5 LTS (kernel version 4.4.0-34-generic)

My applications use the C++ zmq interfaces.

Thanks,
Alok

@aggarwaa aggarwaa changed the title zmq core dumps in epoll fd set manipulation zmq core dumps in epoll fd set manipulation on linux Sep 8, 2016
@bluca
Copy link
Member

bluca commented Sep 8, 2016

a) No, we use asserts intentionally to catch illegal use of the APIs and bugs. This sounds like one of those and we should solve it instead.
b) Sounds reasonable, but I haven't used push/pull that much so I defer to others.

@aggarwaa
Copy link
Author

aggarwaa commented Sep 8, 2016

I'd love to know if my use of the API is incorrect.

Receiver side code:

  zmq::context_t ctx(1);
  zmq::socket_t socket(ctx, ZMQ_PULL);
  int timeout_milliseconds = 500;
  socket.setsockopt(ZMQ_RCVTIMEO, &timeout_milliseconds, sizeof(int));
  socket.bind("tcp://127.0.0.1:6005");

Sender side:

int hwm = 1024;
int yes = 1;
int reconn_max = 2500;  // milliseconds
std::deque<zmq::socket_t> sockets_;
zmq::context_t ctx_{1};
sockets_.emplace_back(ctx_, ZMQ_PUSH);
auto& sock_ = sockets_.back();
sock_.setsockopt(ZMQ_SNDHWM, &hwm, sizeof(hwm));
sock_.setsockopt(ZMQ_IMMEDIATE, &yes, sizeof(yes));
sock_.setsockopt(ZMQ_RECONNECT_IVL_MAX, &reconn_max, sizeof(reconn_max));
sock_.connect("tcp://127.0.0.1:6005"); // IP address is of course an actual address

As I noted earlier, this code has been running just fine for over a year. Only recently did the race start occurring. But I would love to be told that my use of the API is incorrect and that this isn't an actual bug.

@bluca
Copy link
Member

bluca commented Sep 8, 2016

This looks fine to me, it's very possible that it's a problem in how we use epoll.

@bluca
Copy link
Member

bluca commented Sep 8, 2016

There are other polling methods supported (select, poll, kpoll), you could try to build with a different one to try and bisect it a bit.

@aggarwaa
Copy link
Author

aggarwaa commented Sep 8, 2016

I don't believe the problem is with how epoll is being used -- I'm very familiar on how to use it and zmq is using it exactly as intended.

However, it perfectly legit for epoll fd set manipulation syscalls to fail and set an errno. The zmq code aborts processing when it encounters a failure with any of these syscalls -- for example, when an fd to (e)poll on gets closed from underneath zmq, of course the epoll*() will return an EBADF.

It should be just fine to ignore such an fd instead of asserting but I don't know how much of the zmq code needs to be unwound in order to handle it correctly. This is where I would to get some feedback from the zmq authors.

@bluca
Copy link
Member

bluca commented Sep 8, 2016

Yes it's the implications I'm worried about. And most importantly, what would cause an FD to be invalid before it's even added to the poll. We do check when a socket is created.

@bluca
Copy link
Member

bluca commented Sep 12, 2016

Would it be possible for you to provide a test case (like those in libzmq/tests/ ) that reproduces the issue?

I find that with the issues unless a way to reproduce then it's found, they are extremely hard for us to fix.

@aggarwaa
Copy link
Author

Apologies for the delayed response (vacation got in the way). I'll see if I can come up with a test case that repros the problem.

@quiccsmith
Copy link

Hi @bluca and @aggarwaa,
I am trying to verify if this issue has been fixed. If so, which release? We are seeing this issue (bad file descriptor) and add_fd() causing sigabort - bad file descriptor with zeromq 4.2.3.

#0 0x73b52fb8 in _sigprocmask (how=2, set=, oset=0x72a0434c) at ../sysdeps/unix/sysv/linux/sigprocmask.c:57
#1 0x737878ac in SignalHandler::handler (signo=6, siginfo=0x72a04850, sig_context=) at lib/signal_handler.cc:308
#2
#3 0x73b52d64 in GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#4 0x73b567f8 in GI_abort () at abort.c:89
#5 0x737d90fc in zmq::zmq_abort (errmsg
=errmsg
@entry=0x73c424d8 "Bad file descriptor") at src/err.cpp:87
#6 0x737d8860 in zmq::epoll_t::add_fd (this=0x0, fd
=, events=0x72a04bcc) at src/epoll.cpp:88
#7 0x7380ab6c in zmq::tcp_connecter_t::start_connecting (this=0x70800468) at src/tcp_connecter.cpp:198
#8 0x737ee2d4 in zmq::poller_base_t::execute_timers (this=this@entry=0x1901b0) at src/poller_base.cpp:100
#9 0x737d8bc4 in zmq::epoll_t::loop (this=0x1901b0) at src/epoll.cpp:164
#10 0x7380b9c8 in thread_routine (arg=0x1901fc) at src/thread.cpp:100
#11 0x73c8a05c in start_thread (arg=0x72a06430) at pthread_create.c:314
#12 0x73bf5c50 in ?? () at ../ports/sysdeps/unix/sysv/linux/arm/clone.S:92 from /lib/libc.so.6

@bluca
Copy link
Member

bluca commented Jul 24, 2017

I've never managed to reproduce it. Do you have a minimal test case that can reproduce the problem?

@aggarwaa
Copy link
Author

I was never able to create a tight test case that reproduces the problem. We ended up working around it by changing the code that necessitated frequent disconnect/reconnect attempts.

@quiccsmith
Copy link

Hi Aggarwaa, can you share more details of your workaround?

@aggarwaa
Copy link
Author

Our crash scenario was this:

We had a serviceA on a host A that would continually try to connect to another serviceB on a different host B. When the serviceB/hostB were down, the reconnects would be fairly frequent and the zmq crash would ensue.

We worked around this by not having hostA reconnect to hostB when hostB was being administratively down for maintenance, etc.

@santupari
Copy link

Hi All,

We are facing the same exact problem, has there been any update on this issue?
Below is the backtrace that shows the crash in add_fd() with EBADF error.
#0 0x4e1accc0 in GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1 0x4e1b0740 in GI_abort () at abort.c:89
#2 0x4e799f00 in zmq::zmq_abort (errmsg
=errmsg
@entry=0x4e29cbb8 "Bad file descriptor") at src/err.cpp:87
#3 0x4e7997f8 in zmq::epoll_t::add_fd (this=0x4e2bc4c8 , fd
=, events=0x1, events_@entry=0xb5e5e6c0) at src/epoll.cpp:81
#4 0x4e79ace0 in zmq::io_object_t::add_fd (this=this@entry=0xb5e5e6c0, fd_=) at src/io_object.cpp:66
#5 0x4e7c3998 in zmq::stream_engine_t::plug (this=0xb5e5e6c0, io_thread_=0xb6aa5eb0, session_=0xb5efd608) at src/stream_engine.cpp:194
#6 0x4e7b4580 in zmq::session_base_t::process_attach (this=0xb5efd608, engine_=0xb5e5e6c8) at src/session_base.cpp:411
#7 0x4e7a5cf8 in zmq::object_t::process_command (this=0xb5efd608, cmd_=...) at src/object.cpp:100
#8 0x4e79b0dc in zmq::io_thread_t::in_event (this=0xb6aa5eb0) at src/io_thread.cpp:85
#9 0x4e799bf4 in zmq::epoll_t::loop (this=0xb5ea9500) at src/epoll.cpp:179
#10 0x4e7c9834 in thread_routine (arg_=0xb5ea954c) at src/thread.cpp:100
#11 0x4e2d60c4 in start_thread (arg=0xb6168f50) at pthread_create.c:339
#12 0x4e250de0 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /home/nasiha_hrustemovic/debug/L2X19/lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Any workaround is much appreciated.

Thanks,
Santosh

@bluca
Copy link
Member

bluca commented Aug 15, 2017

Do you have a test case to reproduce the issue?

@santupari
Copy link

santupari commented Aug 15, 2017

Thanks for the response @bluca.
This issue occurred after several hours of testing following below testcases:

  1. Client process calls an API which sends a request and receives response message from multiple threads. Below steps are used to perform a transaction to accommodate thread safety for ZMQ sockets:
    a. Create ZMQ context and socket
    b. Connect to server
    c. send the request to server
    d. Wait for response using zmq_poll()
    e. Once the response is available, close the socket and destroy the context
    f. Send the response to caller

The issue occurs after several hours continuous stress on the client.

Thanks,
Santosh

@bluca
Copy link
Member

bluca commented Aug 15, 2017

You can share the context between threads - so you can create one at the start of your application, and destroy it when the application exists.

Also I meant some code to reproduce it - I've never been able to reproduce this issue, and nobody has managed to provide a small, self-contained test case that can reproduce it either.
Without a way to reproduce, it's extremely hard to find out what's going on and to make sure it's fixed I'm afraid.

@santupari
Copy link

Thanks @bluca. I am with you on reproducing part as this issue occurred only once in the last one year of our code development.
We have complex code and I might have to write a sample code and try reproducing it.
If I am able to reproduce with the sample code, i will definitely share it on this thread.

And regarding the context, i will try modifying our code and give a try.

Thanks,
Santosh

@bluca
Copy link
Member

bluca commented Aug 15, 2017

Great, thanks!

@wbb1975
Copy link

wbb1975 commented Dec 20, 2017

I also met similar issue,

Callstack:
(1): /lib64/libc.so.6(gsignal)
(2): /lib64/libc.so.6(abort)
(3): ./lib/libetsmisc.so()
(4): ./lib/libetsmisc.so(zmq::epoll_t::add_fd(int, zmq::i_poll_events*)+0xe8) [0x2aef9c99b728]
(5): ./lib/libetsmisc.so(zmq::tcp_connecter_t::start_connecting()+0x5f) [0x2aef9c9c044f]
(6): ./lib/libetsmisc.so(zmq::poller_base_t::execute_timers()+0x69) [0x2aef9c9a2849]
(7): ./lib/libetsmisc.so(zmq::epoll_t::loop()+0x3c) [0x2aef9c99b92c]
(8): ./lib/libetsmisc.so()
(9): /lib64/libpthread.so.0()
(10): /lib64/libc.so.6(clone)

Our case is very simple:
One client create one REQ and one SUB socket, but both peers don't exist. Next, these SUB socket will be polled. And quickly the program will crashed.
So here are two questions:

  1. IF one application launch REQ and SUB connection, and it no active peer at that time, what will happen to REQ and SUB sockets. Poll on them will lead to crash?
  2. I found that similar issues have been reported by many users in different years, do we have any progress on this issue?
  3. There is some workaround about it?

Best Regards!
Bingbing

@bluca
Copy link
Member

bluca commented Dec 20, 2017

Can you share a minimal, self contained code snippet that reproduces the issue?

@wbb1975
Copy link

wbb1975 commented Feb 22, 2018

Hi Bluca,

Sorry to reply so late because I am focusing solving libzmq related issue to ensure the stability of our application.  Currently we have commented out zmq::zmq_abort (errstr) from err.hpp line #133 (errno_assert macro), although we still can see exception message sometimes but never crash again, and the application can run normally.

Best Regards!
Bingbing

@stale
Copy link

stale bot commented Feb 22, 2019

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions.

@stale stale bot added the stale label Feb 22, 2019
@stale stale bot closed this as completed Apr 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants