Skip to content

Surplus of errno_assert() leading to deamon crash #2334

Closed
@lytboris

Description

@lytboris

A daemon is a program that is designed to run forever so every single error that is not fatal should be handled and the show must go on. Currently ZMQ has 404 errno_assert calls - 404 ways to make a daemon crash with SIGABRT. Please consider this function from tcp.cpp:

void zmq::tune_tcp_socket (fd_t s_)
{
// Disable Nagle's algorithm. We are doing data batching on 0MQ level,
// so using Nagle wouldn't improve throughput in anyway, but it would
// hurt latency.
int nodelay = 1;
int rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELAY, (char*) &nodelay,
sizeof (int));
#ifdef ZMQ_HAVE_WINDOWS
wsa_assert (rc != SOCKET_ERROR);
#else
errno_assert (rc == 0);
#endif

#ifdef ZMQ_HAVE_OPENVMS
// Disable delayed acknowledgements as they hurt latency significantly.
int nodelack = 1;
rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELACK, (char*) &nodelack,
sizeof (int));
errno_assert (rc != SOCKET_ERROR);
#endif
}

When setsockopt() returns an error, your daemon would crash. And there is a trivial error-free scenario when this could happen - remote side can send TCP Reset packet that will immediately invalidate the socket but instead of reconnecting, ZMQ will crash whole app.

I was debugging my app that coredumped at this particular function:

Thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver)):
#0 0x0000000801896dcc in thr_kill () from /lib/libc.so.7
#1 0x000000080193d72b in abort () from /lib/libc.so.7
#2 0x0000000000415ac1 in zmq::zmq_abort (errmsg_=Could not find the frame base for "zmq::zmq_abort(char const*)".
) at src/err.cpp:84
#3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17) at src/tcp.cpp:60
#4 0x0000000000454524 in zmq::tcp_connecter_t::out_event (this=0x80285a600) at src/tcp_connecter.cpp:134
#5 0x0000000000416be6 in zmq::kqueue_t::loop (this=0x802051300) at src/kqueue.cpp:205
#6 0x0000000000416ce5 in zmq::kqueue_t::worker_routine (arg_=0x802051300) at src/kqueue.cpp:222
#7 0x0000000000434bd8 in thread_routine (arg_=0x802051380) at src/thread.cpp:96
#8 0x0000000801618e14 in pthread_getprio () from /lib/libthr.so.3
#9 0x0000000000000000 in ?? ()
(gdb) thread 1
[Switching to thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver))]#3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17)
at src/tcp.cpp:60
60 errno_assert (rc == 0);
(gdb) p errstr
$4 = 0x801b7b240 "Connection reset by peer"
(gdb)

Sure I can rewrite this function to ignore failure non-disabled Naggle and delayed-ACKs, but 402 of errno_assert()s will remain in code. Am I missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions