-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zmq core dumps in epoll fd set manipulation on linux #2103
Comments
a) No, we use asserts intentionally to catch illegal use of the APIs and bugs. This sounds like one of those and we should solve it instead. |
I'd love to know if my use of the API is incorrect. Receiver side code:
Sender side:
As I noted earlier, this code has been running just fine for over a year. Only recently did the race start occurring. But I would love to be told that my use of the API is incorrect and that this isn't an actual bug. |
This looks fine to me, it's very possible that it's a problem in how we use epoll. |
There are other polling methods supported (select, poll, kpoll), you could try to build with a different one to try and bisect it a bit. |
I don't believe the problem is with how epoll is being used -- I'm very familiar on how to use it and zmq is using it exactly as intended. However, it perfectly legit for epoll fd set manipulation syscalls to fail and set an errno. The zmq code aborts processing when it encounters a failure with any of these syscalls -- for example, when an fd to (e)poll on gets closed from underneath zmq, of course the epoll*() will return an EBADF. It should be just fine to ignore such an fd instead of asserting but I don't know how much of the zmq code needs to be unwound in order to handle it correctly. This is where I would to get some feedback from the zmq authors. |
Yes it's the implications I'm worried about. And most importantly, what would cause an FD to be invalid before it's even added to the poll. We do check when a socket is created. |
Would it be possible for you to provide a test case (like those in libzmq/tests/ ) that reproduces the issue? I find that with the issues unless a way to reproduce then it's found, they are extremely hard for us to fix. |
Apologies for the delayed response (vacation got in the way). I'll see if I can come up with a test case that repros the problem. |
Hi @bluca and @aggarwaa, #0 0x73b52fb8 in _sigprocmask (how=2, set=, oset=0x72a0434c) at ../sysdeps/unix/sysv/linux/sigprocmask.c:57 |
I've never managed to reproduce it. Do you have a minimal test case that can reproduce the problem? |
I was never able to create a tight test case that reproduces the problem. We ended up working around it by changing the code that necessitated frequent disconnect/reconnect attempts. |
Hi Aggarwaa, can you share more details of your workaround? |
Our crash scenario was this: We had a serviceA on a host A that would continually try to connect to another serviceB on a different host B. When the serviceB/hostB were down, the reconnects would be fairly frequent and the zmq crash would ensue. We worked around this by not having hostA reconnect to hostB when hostB was being administratively down for maintenance, etc. |
Hi All, We are facing the same exact problem, has there been any update on this issue? Any workaround is much appreciated. Thanks, |
Do you have a test case to reproduce the issue? |
Thanks for the response @bluca.
The issue occurs after several hours continuous stress on the client. Thanks, |
You can share the context between threads - so you can create one at the start of your application, and destroy it when the application exists. Also I meant some code to reproduce it - I've never been able to reproduce this issue, and nobody has managed to provide a small, self-contained test case that can reproduce it either. |
Thanks @bluca. I am with you on reproducing part as this issue occurred only once in the last one year of our code development. And regarding the context, i will try modifying our code and give a try. Thanks, |
Great, thanks! |
I also met similar issue, Callstack: Our case is very simple:
Best Regards! |
Can you share a minimal, self contained code snippet that reproduces the issue? |
Hi Bluca,
Best Regards! |
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
Greetings,
I have been using zmq PUSH/PULL sockets for communication between multiple nodes in a distributed system for over a year now without any problems. Typically, the PULL socket gets connected to by 40 servers that PUSH to it -- note that this is done over the Internet, with some of these socket connections being across continents.
Lately, we have started running into crashes which seem to be as a result of frequent reconnects. The stack traces are of the form:
zmq epoll.cpp code snippet in question:
What I think is happening is - zmq was adding an fd (sockfd specifically since those are the only ones relevant in our case) to it's epoll fd set being watched by one of the I/O threads but because the fd that was bad/had already been closed; epoll_ctl failed with an EBADF which causes zmq to assert. In other cores, it can be seen adding/removing/modifying the set of epoll fds -- some of which fail due to the same race. The asserts hit are usually in zmq::epoll_t::add_fd()/rm_fd()/set_pollin()/reset_pollin()/set_pollout()/reset_pollout().
What is type '7'?
It a ZMQ_PULL socket. Also, as can be seen from the options, the reconnect_ivl is set to the default of 100 milliseconds.
Questions: a) Is it safe to remove the asserts in the said routines (above)? If so, I can provide a patchset against master.
b) I'm going to attempt to "work around" this issue by increasing the ZMQ_RECONNECT_IVL to 1 second on the PUSH side (sender side). Do you agree with that?
c) Is there any other way I can workaround this issue?
FWIW, If my analysis is correct, even the current zmq version 4.1.5 is subject to the same race so upgrading libzmq isn't going to help me here.
Version of zmq being used: 4.0.4
OS Version: Ubuntu Linux 14.04.5 LTS (kernel version 4.4.0-34-generic)
My applications use the C++ zmq interfaces.
Thanks,
Alok
The text was updated successfully, but these errors were encountered: