Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zmq::encoder_base_t segmentation fault #2674

Closed
jnadelman opened this issue Aug 11, 2017 · 20 comments
Closed

zmq::encoder_base_t segmentation fault #2674

jnadelman opened this issue Aug 11, 2017 · 20 comments

Comments

@jnadelman
Copy link

Got a segmentation fault in zmq::encoder_base_t at encoder.hpp:127 after 25.9 hours of operation and 11,393,496 messages. I probably will not be able to provide a minimal reproducible example. Is there a debug or logging mode I can run zmq in to provide more information? Has anything been fixed in the code base between libzmq 4.2.1 and 4.2.2 that might address this? I didn't see anything that looked related in the release notes.

Environment

  • libzmq 4.2.1
  • OS: CentOS Linux 7 (Core)
  • Kernel: Linux 3.18.4
  • Architecture: x86-64

Coredump backtrace:

118		movdqu	-16(%rsi, %rdx), %xmm0
#0  __memcpy_ssse3 () at ../sysdeps/x86_64/multiarch/memcpy-ssse3.S:118
#1  0x00007f35c64f2c90 in zmq::encoder_base_t<zmq::v2_encoder_t>::encode (this=0x7f35b008dd90, data_=0x7f35be7fb128, 
    size_=<optimized out>) at src/encoder.hpp:127
#2  0x00007f35c64e404f in zmq::stream_engine_t::out_event (this=0x7f35b0085250) at src/stream_engine.cpp:395
#3  0x00007f35c64d7d66 in zmq::session_base_t::read_activated (this=0x7f35b00856e0, pipe_=0x7f35b008dbb0)
    at src/session_base.cpp:286
#4  0x00007f35c64bd704 in zmq::io_thread_t::in_event (this=0x1344aa0) at src/io_thread.cpp:85
#5  0x00007f35c64bc18e in zmq::epoll_t::loop (this=0x13454e0) at src/epoll.cpp:188
#6  0x00007f35c64ed945 in thread_routine (arg_=0x1345560) at src/thread.cpp:100
#7  0x00007f35c6283dc5 in start_thread (arg=0x7f35be7fc700) at pthread_create.c:308
#8  0x00007f35c527873d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

@bluca
Copy link
Member

bluca commented Aug 11, 2017

I don't think there are relevant changes since 4.2.1.

Any chance you could build with dbg symbols, run in gdb and try to print the state? At least what reads/writes are causing the segmentation violation?

@jnadelman
Copy link
Author

Yes. I'll do that.

@bluca
Copy link
Member

bluca commented Aug 11, 2017

And there are no "debug" modes I'm afraid.

Also the usual:

  • make sure you are not using/creating/deleting a socket from different threads
  • make sure, if you supply buffers (eg: zmq_msg_init_data), that they are not freed

From the line number, it's either the buffer it's writing into, unlikely since it's allocated in the same class, or the one it's reading from

@jnadelman
Copy link
Author

This time it crashed with SIGABRT at zmq::tcp_write after 6.9h and 3,038,554 messages. This is some of the state information from gdb:

Bad address (src/tcp.cpp:236)
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffeffff700 (LWP 4416)]
0x00007ffff68791d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007ffff68791d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff687a8c8 in __GI_abort () at abort.c:90
#2  0x00007ffff7b7f619 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7ffff69c072f "Bad address") at src/err.cpp:87
#3  0x00007ffff7bac6e0 in zmq::tcp_write (s_=<optimized out>, data_=<optimized out>, size_=<optimized out>) at src/tcp.cpp:228
#4  0x00007ffff7ba708f in zmq::stream_engine_t::out_event (this=0x7fffe806e970) at src/stream_engine.cpp:415
#5  0x00007ffff7b7f16a in zmq::epoll_t::loop (this=0x6f94e0) at src/epoll.cpp:184
#6  0x00007ffff7bb0945 in thread_routine (arg_=0x6f9560) at src/thread.cpp:100
#7  0x00007ffff7946dc5 in start_thread (arg=0x7fffeffff700) at pthread_create.c:308
#8  0x00007ffff693b73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) i r
rax            0x0	0
rbx            0x6f2bf0	7285744
rcx            0xffffffffffffffff	-1
rdx            0x6	6
rsi            0x1140	4416
rdi            0x1133	4403
rbp            0x7ffff69c072f	0x7ffff69c072f
rsp            0x7fffefffe078	0x7fffefffe078
r8             0x7fffeffff700	140737219917568
r9             0x1c	28
r10            0x8	8
r11            0x206	518
r12            0x7fffefffe230	140737219912240
r13            0x6f9578	7312760
r14            0x7fffe8073360	140737086174048
r15            0x7fffefffe234	140737219912244
rip            0x7ffff68791d7	0x7ffff68791d7 <__GI_raise+55>
eflags         0x206	[ PF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0

What else from gdb might be helpful? I'll leave it open.

@jnadelman
Copy link
Author

Here is (gdb) i all

rax            0x0	0
rbx            0x6f2bf0	7285744
rcx            0xffffffffffffffff	-1
rdx            0x6	6
rsi            0x1140	4416
rdi            0x1133	4403
rbp            0x7ffff69c072f	0x7ffff69c072f
rsp            0x7fffefffe078	0x7fffefffe078
r8             0x7fffeffff700	140737219917568
r9             0x1c	28
r10            0x8	8
r11            0x206	518
r12            0x7fffefffe230	140737219912240
r13            0x6f9578	7312760
r14            0x7fffe8073360	140737086174048
r15            0x7fffefffe234	140737219912244
rip            0x7ffff68791d7	0x7ffff68791d7 <__GI_raise+55>
eflags         0x206	[ PF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
st0            0	(raw 0x00000000000000000000)
st1            0	(raw 0x00000000000000000000)
st2            0	(raw 0x00000000000000000000)
st3            0	(raw 0x00000000000000000000)
st4            0	(raw 0x00000000000000000000)
st5            0	(raw 0x00000000000000000000)
st6            0	(raw 0x00000000000000000000)
st7            0	(raw 0x00000000000000000000)
fctrl          0x37f	895
fstat          0x0	0
ftag           0xffff	65535
fiseg          0x0	0
fioff          0x0	0
foseg          0x0	0
fooff          0x0	0
fop            0x0	0
xmm0           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0, 0xff, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0xff00, 0x0, 0xff00, 0x0, 0xff, 0x0, 0x0, 0x0}, 
  v4_int32 = {0xff00, 0xff00, 0xff, 0x0}, v2_int64 = {0xff000000ff00, 0xff}, uint128 = 0x00000000000000ff0000ff000000ff00}
xmm1           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x25 <repeats 16 times>}, v8_int16 = {0x2525, 0x2525, 0x2525, 0x2525, 0x2525, 0x2525, 0x2525, 0x2525}, v4_int32 = {0x25252525, 0x25252525, 0x25252525, 
    0x25252525}, v2_int64 = {0x2525252525252525, 0x2525252525252525}, uint128 = 0x25252525252525252525252525252525}
xmm2           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm3           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x8000000000000000}, v16_int8 = {0x0 <repeats 12 times>, 0xff, 0xff, 0xff, 0xff}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xffff, 0xffff}, v4_int32 = {0x0, 0x0, 0x0, 
    0xffffffff}, v2_int64 = {0x0, 0xffffffff00000000}, uint128 = 0xffffffff000000000000000000000000}
xmm4           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0, 0x0, 0xfe, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0xa, 0x4, 0x8, 0x8, 0x10, 0x2}, v8_int16 = {0x0, 0xfffe, 0xffff, 0x0, 0x0, 0x40a, 0x808, 0x210}, 
  v4_int32 = {0xfffe0000, 0xffff, 0x40a0000, 0x2100808}, v2_int64 = {0xfffffffe0000, 0x2100808040a0000}, uint128 = 0x02100808040a00000000fffffffe0000}
xmm5           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x78, 0x0, 0x0, 0xf0, 0xff, 0x7f, 0x0, 0x0, 0xe0, 0x50, 0x1, 0xf0, 0xff, 0x7f, 0x0, 0x0}, v8_int16 = {0x78, 0xf000, 0x7fff, 0x0, 0x50e0, 0xf001, 
    0x7fff, 0x0}, v4_int32 = {0xf0000078, 0x7fff, 0xf00150e0, 0x7fff}, v2_int64 = {0x7ffff0000078, 0x7ffff00150e0}, uint128 = 0x00007ffff00150e000007ffff0000078}
xmm6           {v4_float = {0x552d, 0x0, 0x0, 0x0}, v2_double = {0x8000000000000000, 0x0}, v16_int8 = {0x93, 0x5b, 0xaa, 0x46, 0x98, 0x45, 0x45, 0x62, 0x35, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x5b93, 0x46aa, 0x4598, 0x6245, 
    0x35, 0x0, 0x0, 0x0}, v4_int32 = {0x46aa5b93, 0x62454598, 0x35, 0x0}, v2_int64 = {0x6245459846aa5b93, 0x35}, uint128 = 0x00000000000000356245459846aa5b93}
xmm7           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm8           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm9           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm10          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm11          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm12          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm13          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm14          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
xmm15          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, 
  uint128 = 0x00000000000000000000000000000000}
mxcsr          0x1fa0	[ PE IM DM ZM OM UM PM ]

@jnadelman
Copy link
Author

@bluca
How do I print the reads/writes and other state variables in gdb that may be causing this issue as you requested?

@jnadelman
Copy link
Author

Here is a backtrace full

(gdb) backtrace full
#0  0x00007ffff68791d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 4403
        selftid = 4416
#1  0x00007ffff687a8c8 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x6f2bf0 <stderr@@GLIBC_2.2.5>, sa_sigaction = 0x6f2bf0 <stderr@@GLIBC_2.2.5>}, sa_mask = {__val = {140737330808623, 140737219912240, 140737351983296, 140737219912240, 140737351983296, 
              0, 140737329801694, 7310120, 140737349432941, 140737219917408, 0, 0, 140737330203773, 0, 140737333168624, 140737330808623}}, sa_flags = -268437760, sa_restorer = 0x1c}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ffff7b7f619 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7ffff69c072f "Bad address") at src/err.cpp:87
No locals.
#3  0x00007ffff7bac6e0 in zmq::tcp_write (s_=<optimized out>, data_=<optimized out>, size_=<optimized out>) at src/tcp.cpp:228
        errstr = 0x7ffff69c072f "Bad address"
        nbytes = <optimized out>
#4  0x00007ffff7ba708f in zmq::stream_engine_t::out_event (this=0x7fffe806e970) at src/stream_engine.cpp:415
        nbytes = <optimized out>
#5  0x00007ffff7b7f16a in zmq::epoll_t::loop (this=0x6f94e0) at src/epoll.cpp:184
        pe = 0x7fffe8073360
        i = <optimized out>
        timeout = <optimized out>
        n = <optimized out>
        ev_buf = {{events = 4, data = {ptr = 0x7fffe8073360, fd = -402181280, u32 = 3892786016, u64 = 140737086174048}}, {events = 4, data = {ptr = 0x7fffe80180e0, fd = -402554656, u32 = 3892412640, u64 = 140737085800672}}, {events = 4, 
            data = {ptr = 0x7fffe8027640, fd = -402491840, u32 = 3892475456, u64 = 140737085863488}}, {events = 4, data = {ptr = 0x7fffe8073360, fd = -402181280, u32 = 3892786016, u64 = 140737086174048}}, {events = 4, data = {
              ptr = 0x7fffe80180e0, fd = -402554656, u32 = 3892412640, u64 = 140737085800672}}, {events = 4, data = {ptr = 0x7fffe8027640, fd = -402491840, u32 = 3892475456, u64 = 140737085863488}}, {events = 4, data = {
              ptr = 0x7fffe8073360, fd = -402181280, u32 = 3892786016, u64 = 140737086174048}}, {events = 1, data = {ptr = 0x7fffe8000c20, fd = -402650080, u32 = 3892317216, u64 = 140737085705248}}, {events = 1, data = {
              ptr = 0x7fffe8000dd0, fd = -402649648, u32 = 3892317648, u64 = 140737085705680}}, {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}} <repeats 247 times>}
#6  0x00007ffff7bb0945 in thread_routine (arg_=0x6f9560) at src/thread.cpp:100
        signal_set = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}}
        rc = <optimized out>
        self = 0x6f9560
#7  0x00007ffff7946dc5 in start_thread (arg=0x7fffeffff700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7fffeffff700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737219917568, -386163440349031993, 1, 140737219918272, 140737219917568, 2, 386128256501338567, 386180702302556615}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#8  0x00007ffff693b73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

@bluca
Copy link
Member

bluca commented Aug 15, 2017

This too points to memory corruption, from the manpage of send:

EFAULT An invalid user space address was specified for an argument.

It would be useful to print the value of size_ and data_ from the zmq::tcp_write frame.

@jnadelman
Copy link
Author

@bluca - i frame zmq::tcp_write fails. So I also tried i frame 0x00007ffff7bac6e0 but that fails the same way. Should I have used a different command to print the frame? I didn't exit GDB just yet since it may still be possible to pull something useful out of it. I did let it create the core dump but won't be able to GDB the core until I exit this session. Is there any kind of debug I should add to my code that would be helpful? Perhaps copy size and data to static var?

(gdb) i frame zmq::tcp_write
Stack frame at 0x7ffff7bac640:
 rip = 0x0; saved rip 0x7ffff687a8c8
 Outermost frame: previous frame identical to this frame (corrupt stack?)
 Arglist at 0x7fffefffe070, args: 
 Locals at 0x7fffefffe070, Previous frame's sp is 0x7fffefffe080
../../gdb/valops.c:1101: internal-error: value_fetch_lazy: Assertion `frame != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n
../../gdb/valops.c:1101: internal-error: value_fetch_lazy: Assertion `frame != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) y

@bluca
Copy link
Member

bluca commented Aug 15, 2017

Could you try to just go up in the backtrace? If you are at the abort up 3 should rewind the context to there

@jnadelman
Copy link
Author

I did that (up) but the variables were optimized out. When I attempted to look at the assembler (-) to get the values the stack frame was lost. So I started to look for a way to add some debug and may have found and fixed the issue in my code where it calls send. I will update as soon as I know (this crash takes time).

@bluca
Copy link
Member

bluca commented Aug 15, 2017

You can also try to build without optimisations and with extra debugs (./configure CFLAGS="-ggdb3 -O0" CXXFLAGS="-ggdb3 -O0") to avoid the optimizing-out issue.

@jnadelman
Copy link
Author

Thanks a million! I'll do that next if the issue is not resolved.

@jnadelman
Copy link
Author

Still crashing. Built with no optimizations and extra debug but still see size = <optimized out> in #4 zmq::v2_encoder_t::message_ready. As a big picture, the system is XPUB/XSUB with one (1) client and four (4) servers. One thread is exchanging about 124 small (< 1k) messages per second between the client and one of the servers. Another thread periodically exchanges a heartbeat with all (4) servers. I was able to decrease the time to fail by decreasing the heartbeat period from 10s to 100ms.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffeffff700 (LWP 22526)]
0x00007ffff68791d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt full
#0  0x00007ffff68791d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 22513
        selftid = 22526
#1  0x00007ffff687a8c8 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7fffe80276e0, sa_sigaction = 0x7fffe80276e0}, sa_mask = {__val = {7285744, 140737219911976, 140737351983296, 140737219911976, 140737351983296, 511101108348, 140737085702176, 
              140736884388224, 140737349486946, 140737086310824, 140737086191264, 0, 140737330203773, 0, 140737333168624, 140737349679919}}, sa_flags = -268437760, sa_restorer = 0x2b}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ffff7b7f619 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7ffff7bbfb2f "check ()") at src/err.cpp:87
No locals.
#3  0x00007ffff7b878db in zmq::msg_t::size (this=0x7fffe80276e0) at src/msg.cpp:361
No locals.
#4  0x00007ffff7bb5952 in zmq::v2_encoder_t::message_ready (this=0x7fffe80275e0) at src/v2_encoder.cpp:54
        protocol_flags = @0x7fffe8027628: 0 '\000'
        size = <optimized out>
#5  0x00007ffff7ba7021 in zmq::stream_engine_t::out_event (this=0x7fffe80276c0) at src/stream_engine.cpp:393
        bufptr = 0x7fffe8030200 ""
        n = <optimized out>
        nbytes = <optimized out>
#6  0x00007ffff7b9ad66 in zmq::session_base_t::read_activated (this=0x7fffe8027b50, pipe_=0x7fffe8030060) at src/session_base.cpp:286
No locals.
#7  0x00007ffff7b80704 in zmq::io_thread_t::in_event (this=0x6f8aa0) at src/io_thread.cpp:85
        cmd = {destination = 0x7fffe8030060, type = zmq::command_t::activate_read, args = {stop = {<No data fields>}, plug = {<No data fields>}, own = {object = 0x7ffff00048d0}, attach = {engine = 0x7ffff00048d0}, bind = {
              pipe = 0x7ffff00048d0}, activate_read = {<No data fields>}, activate_write = {msgs_read = 140737219938512}, hiccup = {pipe = 0x7ffff00048d0}, pipe_term = {<No data fields>}, pipe_term_ack = {<No data fields>}, term_req = {
              object = 0x7ffff00048d0}, term = {linger = -268416816}, term_ack = {<No data fields>}, reap = {socket = 0x7ffff00048d0}, reaped = {<No data fields>}, done = {<No data fields>}}}
        rc = 0
#8  0x00007ffff7b7f18e in zmq::epoll_t::loop (this=0x6f94e0) at src/epoll.cpp:188
        pe = 0x6f7e00
        i = <optimized out>
        timeout = <optimized out>
        n = <optimized out>
        ev_buf = {{events = 1, data = {ptr = 0x6f7e00, fd = 7306752, u32 = 7306752, u64 = 7306752}}, {events = 1, data = {ptr = 0x7fffe8000dd0, fd = -402649648, u32 = 3892317648, u64 = 140737085705680}}, {events = 4, data = {
              ptr = 0x7fffe8000c20, fd = -402650080, u32 = 3892317216, u64 = 140737085705248}}, {events = 4, data = {ptr = 0x7fffe8027450, fd = -402492336, u32 = 3892474960, u64 = 140737085862992}}, {events = 4, data = {
              ptr = 0x7fffe8005ca0, fd = -402629472, u32 = 3892337824, u64 = 140737085725856}}, {events = 1, data = {ptr = 0x7fffe8000dd0, fd = -402649648, u32 = 3892317648, u64 = 140737085705680}}, {events = 0, data = {ptr = 0x0, 
              fd = 0, u32 = 0, u64 = 0}} <repeats 250 times>}

@bluca
Copy link
Member

bluca commented Aug 15, 2017

That's a basic sanity check on the message object, which is failing:

https://github.com/zeromq/libzmq/blob/v4.2.1/src/msg.cpp#L51

All of these really still points to memory corruption. I would suggest to run the application through valgrind, or compiling with gcc's address sanitizer, to check for buffer overflows etc etc

@jimklimov
Copy link
Member

Just in case, did you rule out physical memory corruption - e.g. old DRAM chips, power line (or PSU) noise, bad contacts etc.? Is RAM with ECC? Do any other processes and subsystems like FS cache behave funny?

Sent from my Xiaomi Redmi Note 4 using FastHub

@jnadelman
Copy link
Author

@bluca I'll see what gcc address sanitizer and valgrind. I've run valgrind on this code before but not recently and there have been many changes since.

@jimklimov - this is a fairly new PC but I will try it on another.

@jnadelman
Copy link
Author

@bluca

Made some changes based on valgrind that may have resolved this issue. I'll call it fixed if it survives the weekend. There was a smart pointer to a struct containing smart pointer to a buffer in a map that seemed to be the issue when erasing a range of map entries.

Thanks a million for the assistance!

@bluca
Copy link
Member

bluca commented Aug 18, 2017

Great, fingers crossed!

@jnadelman
Copy link
Author

Segmentation fault at 41.7h. Made additional changes based on valgrind and running it again. I'm closing this issue since the trouble is most likely in my code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants