Skip to content

Resque workers deadlocked from ddtrace after processing jobs #3015

Closed

Description

A few weeks ago we updated ddtrace in our application from version 0.53 to 1,11.1. Shortly after deploying this to production we noticed that some of our Resque workers would hang after they processed a job resulting in the worker to stop processing further jobs and causing a backups in the Resque queues. We identified the ddtrace upgrade as the source of this problem and reverted back to version 0.53 to resolve it for the time being.

Since then I've been working to identify what the root cause of this is. Here's what I know so far:

  • On versions of ddtrace 1.3.0 and beyond Resque workers will become stuck, apparently deadlocked on a futex, during a Ruby cleanup method after the threads are marked as killed. The forked Resque process continues sending a heartbeat however leading Resque to believe that it is simply processing a really long job. This will remain the case indefinitely until the process is manually killed.
  • This issue is only reproducible when many Resque workers are running. Initial attempts to reproduce with a single worker were unsuccessful, except for one anomalous case which was never reproduced a second time. In testing, at least 10 workers needed to be running in order to consistently reproduce the behavior.
  • Most of the time the Resque workers would hang almost immediately or within a few minutes of starting to process jobs. However, in some cases it would take multiple hours to demonstrate the behavior.
    • Interestingly, tests with version 1.12 yielded hung workers almost always immediately and tests with version 1.3 typically took more time. Whether something is making the problem more frequent in version 1.12, the problem manifests itself differently, or there are multiple problems in later versions is undetermined.
  • Using GDB I got a stack trace of where the threads are getting stuck. See below for this. It appears to be deadlocked on a futex during a Ruby GC run, specifically while unloading a library. I assume this is the ddtrace's native extension library, but I cannot conclusively determine this.
  • I bisected the problematic behavior to being introduced by commit 26ac04c06f87918eae773692dd428491bf8a11a4. This appears to add tracking of CPU-time for profiling.
    • I've tried going through the changes of this commit to identify what could be causing the deadlock, but don't see anything obvious. I also tried disabling the CPU-time sampling by commenting out the relevant code, but the problem persists.

Threads as displayed by GDB:

(gdb) info threads
  Id   Target Id                                                     Frame
* 1    Thread 0x7ff481009f80 (LWP 29068) "bundle"                    0x00007ff481afdc5c in futex_wait (private=0, expected=2, futex_word=0x7ff44c1ff65c) at ../sysdeps/nptl/futex-internal.h:146
  2    Thread 0x7ff442a13640 (LWP 29069) "async_logger.r*" (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  3    Thread 0x7ff47a471640 (LWP 29070) "connection_poo*" (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  4    Thread 0x7ff44c1ff640 (LWP 29071) "Timeout stdlib " (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  5    Thread 0x7ff44c9a2640 (LWP 29072) "Statsd Sender" (Exiting)   futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
  6    Thread 0x7ff44455b640 (LWP 29073) "Datadog::Tracin" (Exiting) futex_wait (private=0, expected=2, futex_word=0x7ff481b25108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146

GDB backtrace of all threads:

(gdb) thread apply all where

Thread 6 (Thread 0x7f73145ff640 (LWP 5414) "Datadog::Tracin" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f73145ff640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f73145ff640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7f731cd47640 (LWP 5413) "Statsd Sender" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f731cd47640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f731cd47640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7f731c40b640 (LWP 5412) "Timeout stdlib " (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f731c40b640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f731c40b640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7f734ab9d640 (LWP 5411) "connection_poo*" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f734ab9d640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f734ab9d640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f730f26f640 (LWP 5410) "async_logger.r*" (Exiting)):
#0  futex_wait (private=0, expected=2, futex_word=0x7f7351c3d108 <_rtld_global+4296>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f7351c3d108 <_rtld_global+4296>) at ./nptl/lowlevellock.c:34
#2  0x00007f73511b55ec in __GI___nptl_deallocate_stack (pd=pd@entry=0x7f730f26f640) at ./nptl/nptl-stack.c:112
#3  0x00007f73511b57fd in __GI___nptl_free_tcb (pd=0x7f730f26f640) at ./nptl/nptl_free_tcb.c:41
#4  0x00007f73511b8ad4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:573
#5  0x00007f735124aa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f7351121f80 (LWP 5409) "bundle"):
#0  0x00007f7351c15c5c in futex_wait (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:146
#1  futex_wait_simple (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:177
#2  __thread_gscope_wait () at ../sysdeps/nptl/dl-thread_gscope_wait.c:51
#3  0x00007f7351c04d97 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:498
#4  0x00007f7351c052a2 in _dl_close_worker (force=false, map=0x7f73215aab00) at ./elf/dl-close.c:150
#5  _dl_close (_map=0x7f73215aab00) at ./elf/dl-close.c:818
#6  0x00007f7351298c28 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd558e28d0, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#7  0x00007f7351298cf3 in __GI__dl_catch_error (objname=0x7ffd558e2928, errstring=0x7ffd558e2930, mallocedp=0x7ffd558e2927, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#8  0x00007f73511b41ae in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#9  0x00007f73511b3ed8 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#10 0x00007f73214ed4c5 in library_free (library=0x7f7323f76b48) at DynamicLibrary.c:171
#11 0x00007f7351610d2a in run_final (zombie=140132487292600, objspace=0x7f7350c32000) at gc.c:4019
#12 finalize_list (objspace=objspace@entry=0x7f7350c32000, zombie=140132487292600) at gc.c:4038
#13 0x00007f7351615c10 in rb_objspace_call_finalizer (objspace=0x7f7350c32000) at gc.c:4202
#14 0x00007f73515f5058 in rb_ec_finalize (ec=0x7f7350c3b050) at eval.c:164
#15 rb_ec_cleanup (ec=<optimized out>, ex0=<optimized out>) at eval.c:256
#16 0x00007f73515f5302 in ruby_cleanup (ex=<optimized out>) at eval.c:178
#17 0x00007f73515f531f in ruby_stop (ex=<optimized out>) at eval.c:291
#18 0x00007f73516eac48 in rb_f_fork (obj=<optimized out>) at process.c:4407
#19 rb_f_fork (obj=<optimized out>) at process.c:4397
#20 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3fc40, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#21 0x00007f73517dac5a in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:349
#22 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:880
#23 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#24 0x00007f73517df3af in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=0x0, self=140132116319240, iseq=0x7f73468d7500, ec=0x7f7350c3b050) at vm.c:1316
#25 invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=0, argv=0x0, argc=0, self=140132116319240, captured=<optimized out>, ec=0x7f7350c3b050) at vm.c:1372
#26 invoke_block_from_c_bh (argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>, is_lambda=<optimized out>, force_blockarg=<optimized out>, cref=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1390
#27 vm_yield_with_cref (argc=0, argv=0x0, kw_splat=0, is_lambda=0, cref=0x0, ec=0x7f7350c3b050) at vm.c:1427
#28 vm_yield (kw_splat=0, argv=0x0, argc=0, ec=0x7f7350c3b050) at vm.c:1435
#29 rb_yield_0 (argv=0x0, argc=0) at /usr/src/ruby/vm_eval.c:1347
#30 loop_i (_=_@entry=0) at /usr/src/ruby/vm_eval.c:1446
#31 0x00007f73515f2a1e in rb_vrescue2 (b_proc=0x7f73517df170 <loop_i>, data1=0, r_proc=0x7f73517c17f0 <loop_stop>, data2=0, args=args@entry=0x7ffd558e3010) at eval.c:903
#32 0x00007f73515f2cae in rb_rescue2 (b_proc=<optimized out>, data1=<optimized out>, r_proc=<optimized out>, data2=<optimized out>) at eval.c:884
#33 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3fe40, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#34 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3fe40, cd=0x7f7346a650e0, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#35 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#36 0x00007f73517ddf4f in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2211
#37 0x00007f73517e35c9 in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=<optimized out>, self=<optimized out>, iseq=<optimized out>, ec=<optimized out>) at vm.c:1316
#38 invoke_iseq_block_from_c (me=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, self=<optimized out>, captured=<optimized out>, ec=<optimized out>) at vm.c:1372
#39 invoke_block_from_c_bh (force_blockarg=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1390
#40 vm_yield_with_cref (is_lambda=<optimized out>, cref=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1427
#41 vm_yield (kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1435
#42 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at /usr/src/ruby/vm_eval.c:1347
#43 rb_yield (val=<optimized out>) at /usr/src/ruby/vm_eval.c:1363
#44 0x00007f7351551f1c in rb_ary_each (ary=140132436390720) at array.c:2522
#45 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3ff40, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#46 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d3ff40, cd=0x7f734cc6f460, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#47 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#48 0x00007f73517ddf4f in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2211
#49 0x00007f73515f2fd6 in rb_ensure (b_proc=0x7f7351044610 <monitor_sync_body>, data1=140132436390640, e_proc=<optimized out>, data2=<optimized out>) at eval.c:993
#50 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40080, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#51 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40080, cd=0x7f734c63d490, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#52 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#53 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#54 0x00007f73517e35c9 in invoke_block (captured=<optimized out>, captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=<optimized out>, self=<optimized out>, iseq=<optimized out>, ec=<optimized out>) at vm.c:1316
#55 invoke_iseq_block_from_c (me=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, self=<optimized out>, captured=<optimized out>, ec=<optimized out>) at vm.c:1372
#56 invoke_block_from_c_bh (force_blockarg=<optimized out>, is_lambda=<optimized out>, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1390
#57 vm_yield_with_cref (is_lambda=<optimized out>, cref=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1427
#58 vm_yield (kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1435
#59 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at /usr/src/ruby/vm_eval.c:1347
#60 rb_yield (val=<optimized out>) at /usr/src/ruby/vm_eval.c:1363
#61 0x00007f7351551f1c in rb_ary_each (ary=140133177426480) at array.c:2522
#62 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40200, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#63 0x00007f73517cec29 in vm_sendish (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40200, cd=0x7f734ccd6670, block_handler=<optimized out>, method_explorer=mexp_search_method) at /usr/src/ruby/vm_callinfo.h:349
#64 0x00007f73517d87f2 in vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:759
#65 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#66 0x00007f7351656b16 in load_iseq_eval (fname=140133177451160, ec=0x7f7350c3b050) at load.c:660
#67 rb_load_internal (wrap=<optimized out>, fname=140133177451160) at load.c:721
#68 rb_f_load (argc=<optimized out>, argv=<optimized out>, _=<optimized out>) at load.c:796
#69 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40400, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3037
#70 0x00007f73517d8774 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:349
#71 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:778
#72 0x00007f73517ddf4f in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2211
#73 0x00007f7351656b16 in load_iseq_eval (fname=140133183456280, ec=0x7f7350c3b050) at load.c:660
#74 rb_load_internal (wrap=<optimized out>, fname=140133183456280) at load.c:721
#75 rb_f_load (argc=<optimized out>, argv=<optimized out>, _=<optimized out>) at load.c:796
#76 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d40480, calling=<optimized out>) at /usr/src/ruby
#77 0x00007f73517d8774 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optim
#78 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:778
#79 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#80 0x00007f7351656b16 in load_iseq_eval (fname=140133257554840, ec=0x7f7350c3b050) at load.c:660
#81 rb_load_internal (wrap=<optimized out>, fname=140133257554840) at load.c:721
#82 rb_f_load (argc=<optimized out>, argv=<optimized out>, _=<optimized out>) at load.c:796
#83 0x00007f73517c9567 in vm_call_cfunc_with_frame (ec=0x7f7350c3b050, reg_cfp=0x7f7350d407c0, calling=<optimized out>) at /usr/src/ruby
#84 0x00007f73517d8774 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optim
#85 vm_exec_core (ec=0x7f7350c3b050, initial=128) at /usr/src/ruby/insns.def:778
#86 0x00007f73517de886 in rb_vm_exec (ec=0x7f7350c3b050, mjit_enable_p=true) at vm.c:2220
#87 0x00007f73515eef71 in rb_ec_exec_node (ec=ec@entry=0x7f7350c3b050, n=n@entry=0x7f73510af980) at eval.c:280
#88 0x00007f73515f538a in ruby_run_node (n=0x7f73510af980) at eval.c:321
#89 0x000056068a1df17f in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:47

Using GDB I also got a Ruby backtrace of the main thread. The other threads were marked as killed already by Ruby and did not have a control frame pointer anymore to read a backtrace from:

from /usr/local/bin/bundle:25:in `<main>'
from /usr/local/bin/bundle:25:in `load'
from /usr/local/lib/ruby/gems/3.1.0/gems/bundler-2.3.26/libexec/bundle:36:in `<top (required)>'
from /usr/local/lib/ruby/3.1.0/bundler/friendly_errors.rb:120:in `with_friendly_errors'
from /usr/local/lib/ruby/gems/3.1.0/gems/bundler-2.3.26/libexec/bundle:48:in `block in <top (required)>'
from /usr/local/lib/ruby/3.1.0/bundler/cli.rb:25:in `start'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
from /usr/local/lib/ruby/3.1.0/bundler/cli.rb:31:in `dispatch'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
from /usr/local/lib/ruby/3.1.0/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
from /usr/local/lib/ruby/3.1.0/bundler/cli.rb:486:in `exec'
from /usr/local/lib/ruby/3.1.0/bundler/cli/exec.rb:23:in `run'
from /usr/local/lib/ruby/3.1.0/bundler/cli/exec.rb:58:in `kernel_load'
from /usr/local/lib/ruby/3.1.0/bundler/cli/exec.rb:58:in `load'
from /usr/local/bin/rake:25:in `<top (required)>'
from /usr/local/bin/rake:25:in `load'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/exe/rake:27:in `<top (required)>'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:80:in `run'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:186:in `standard_exception_handling'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:83:in `block in run'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:110:in `top_level'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:125:in `run_with_threads'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:116:in `block in top_level'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:116:in `each'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:116:in `block (2 levels) in top_level'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/application.rb:160:in `invoke_task'
from /usr/local/bundle/gems/ddtrace-1.11.1/lib/datadog/tracing/contrib/rake/instrumentation.rb:19:in `invoke'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:188:in `invoke'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:199:in `invoke_with_call_chain'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:199:in `synchronize'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:219:in `block in invoke_with_call_chain'
from /usr/local/bundle/gems/sentry-ruby-5.6.0/lib/sentry/rake.rb:26:in `execute'
from /usr/local/bundle/gems/ddtrace-1.11.1/lib/datadog/tracing/contrib/rake/instrumentation.rb:30:in `execute'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:281:in `execute'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:281:in `each'
from /usr/local/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/task.rb:281:in `block in execute'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/tasks.rb:20:in `block (2 levels) in <top (required)>'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:226:in `work'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:226:in `loop'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:229:in `block in work'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:255:in `work_one_job'
from /usr/local/bundle/bundler/gems/resque-0eac2a8a07be/lib/resque/worker.rb:865:in `perform_with_fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:27:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:8:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:27:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:8:in `fork'
from /usr/local/bundle/gems/activesupport-6.1.7.4/lib/active_support/fork_tracker.rb:8:in `fork'

The one Datadog thread's full name is Datadog::Tracing::Workers::AsyncTransport per:

(gdb) p *((struct RString*) ruby_current_ec->thread_ptr->name)
$6 = {basic = {flags = 546449509, klass = 140133257733160}, as = {heap = {len = 41, ptr = 0x7f734c1ec700 "Datadog::Tracing::Workers::AsyncTransport", aux = {capa = 51, shared = 51}}, embed = {
      ary = ")\000\000\000\000\000\000\000\000\307\036Ls\177\000\000\063\000\000\000\000\000\000"}}}

I initially though the problem was with the AsyncTransport class, but after looking at the concurrency logic there I was unable to find any problems with it. Futhermore, returning to the backtraces above, the problem seems to lie elsewhere. Ruby has all threads other than the main thread marked as killed already so it would appear they joined the main thread cleanly.

The main thread on the other hand is attempting to unload a library when it becomes stuck per the backtrace. Namely, it is stopping the fork (rb_f_fork and ruby_stop) then cleaning up with ruby_cleanup and library_free before finally becoming deadlocked at futex_wait.

  0x00007f7351c15c5c in futex_wait (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:146
#1  futex_wait_simple (private=0, expected=2, futex_word=0x7f731c40b65c) at ../sysdeps/nptl/futex-internal.h:177
#2  __thread_gscope_wait () at ../sysdeps/nptl/dl-thread_gscope_wait.c:51
#3  0x00007f7351c04d97 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:498
#4  0x00007f7351c052a2 in _dl_close_worker (force=false, map=0x7f73215aab00) at ./elf/dl-close.c:150
#5  _dl_close (_map=0x7f73215aab00) at ./elf/dl-close.c:818
#6  0x00007f7351298c28 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd558e28d0, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#7  0x00007f7351298cf3 in __GI__dl_catch_error (objname=0x7ffd558e2928, errstring=0x7ffd558e2930, mallocedp=0x7ffd558e2927, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#8  0x00007f73511b41ae in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#9  0x00007f73511b3ed8 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#10 0x00007f73214ed4c5 in library_free (library=0x7f7323f76b48) at DynamicLibrary.c:171
#11 0x00007f7351610d2a in run_final (zombie=140132487292600, objspace=0x7f7350c32000) at gc.c:4019
#12 finalize_list (objspace=objspace@entry=0x7f7350c32000, zombie=140132487292600) at gc.c:4038
#13 0x00007f7351615c10 in rb_objspace_call_finalizer (objspace=0x7f7350c32000) at gc.c:4202
#14 0x00007f73515f5058 in rb_ec_finalize (ec=0x7f7350c3b050) at eval.c:164
#15 rb_ec_cleanup (ec=<optimized out>, ex0=<optimized out>) at eval.c:256
#16 0x00007f73515f5302 in ruby_cleanup (ex=<optimized out>) at eval.c:178
#17 0x00007f73515f531f in ruby_stop (ex=<optimized out>) at eval.c:291
#18 0x00007f73516eac48 in rb_f_fork (obj=<optimized out>) at process.c:4407

This backtrace combined with the git bisecting of the ddtrace gem leads me to believe there is an issue with something in the native extension preventing Ruby from continuing with its cleanup of the fork resulting in a hung Resque worker process. This only appears to happen at high load with sufficient number of Resque workers. It's also worth noting that with 10 Resque workers, I never saw them all become stuck. At most 5-6 of them would become stuck over the timespan of ~12 hours. I assume this had to do with why multiple workers were needed to replicate the problem in the first place.

There's one other notable situation that happened too. I've inspected many of these deadlocked processes via GDB while debugging this and they all had the same backtrace. However, in a single case, which I never saw a second time, the backtrace was different with it being stuck on a getaddrinfo call. I don't know if this is relevant since it ultimately still ends stuck on a futex_wait call, but it stood out to me enough to save a copy of the backtrace when I saw it:

Attaching to process 20053
[New LWP 20058]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__pthread_kill_implementation (no_tid=0, signo=26, threadid=140422026360384) at ./nptl/pthread_kill.c:50
50 ./nptl/pthread_kill.c: No such file or directory.
(gdb) info threads
  Id   Target Id                                           Frame
* 1    Thread 0x7fb6c9bcef80 (LWP 20053) "bundle"          __pthread_kill_implementation (no_tid=0, signo=26, threadid=140422026360384) at ./nptl/pthread_kill.c:50
  2    Thread 0x7fb68cfff640 (LWP 20058) "Datadog::Tracin" futex_wait (private=0, expected=2, futex_word=0x7fb6c9df3040 <lock>) at ../sysdeps/nptl/futex-internal.h:146
(gdb)

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fb68cfff640 (LWP 20058))]
#0  futex_wait (private=0, expected=2, futex_word=0x7fb6c9df3040 <lock>) at ../sysdeps/nptl/futex-internal.h:146
146  ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7fb6c9df3040 <lock>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7fb6c9df3040 <lock>) at ./nptl/lowlevellock.c:34
#2  0x00007fb6c9d14e14 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7fb68cef9922, seen_ipv6=seen_ipv6@entry=0x7fb68cef9923, in6ai=in6ai@entry=0x7fb68cef9930, in6ailen=in6ailen@entry=0x7fb68cef9938)
    at ../sysdeps/unix/sysv/linux/check_pf.c:307
#3  0x00007fb6c9cdded1 in __GI_getaddrinfo (name=<optimized out>, service=<optimized out>, hints=<optimized out>, pai=<optimized out>) at ../sysdeps/posix/getaddrinfo.c:2289
#4  0x00007fb6c489ee3c in nogvl_getaddrinfo (arg=arg@entry=0x7fb68cef9fa0) at raddrinfo.c:190
#5  0x00007fb6ca240ba5 in rb_nogvl (func=0x7fb6c489ee20 <nogvl_getaddrinfo>, data1=0x7fb68cef9fa0, ubf=<optimized out>, data2=<optimized out>, flags=<optimized out>) at thread.c:1691
#6  0x00007fb6c48a16ae in rsock_getaddrinfo (host=140422157856960, port=<optimized out>, hints=hints@entry=0x7fb68cefa840, socktype_hack=socktype_hack@entry=1) at raddrinfo.c:562
#7  0x00007fb6c48a21d6 in rsock_addrinfo (host=<optimized out>, port=<optimized out>, family=family@entry=0, socktype=socktype@entry=1, flags=flags@entry=0) at raddrinfo.c:604
#8  0x00007fb6c489663b in init_inetsock_internal (v=v@entry=140422025292304) at ipsocket.c:63
#9  0x00007fb6ca09ffd6 in rb_ensure (b_proc=b_proc@entry=0x7fb6c48965c0 <init_inetsock_internal>, data1=data1@entry=140422025292304, e_proc=e_proc@entry=0x7fb6c4896570 <inetsock_cleanup>, 
    data2=data2@entry=140422025292304) at eval.c:993
#10 0x00007fb6c4896c7a in rsock_init_inetsock (sock=sock@entry=140421727738720, remote_host=remote_host@entry=140422157856960, remote_serv=remote_serv@entry=16253, local_host=local_host@entry=8, 
    local_serv=<optimized out>, type=type@entry=0, resolv_timeout=8, connect_timeout=8) at ipsocket.c:187
#11 0x00007fb6c489716e in tcp_init (argc=<optimized out>, argv=<optimized out>, sock=140421727738720) at tcpsocket.c:48
#12 0x00007fb6ca291161 in vm_call0_cfunc_with_frame (argv=0x7fb68cefd998, calling=<optimized out>, ec=0x7fb689a22050) at /usr/src/ruby/vm_eval.c:149
#13 vm_call0_cfunc (argv=0x7fb68cefd998, calling=<optimized out>, ec=0x7fb689a22050) at /usr/src/ruby/vm_eval.c:163
#14 vm_call0_body (ec=0x7fb689a22050, calling=<optimized out>, argv=0x7fb68cefd998) at /usr/src/ruby/vm_eval.c:209
#15 0x00007fb6ca293c64 in vm_call0_cc (kw_splat=<optimized out>, cc=<optimized out>, argv=<optimized out>, argc=4, id=3121, recv=140421727738720, ec=0x7fb689a22050) at /usr/src/ruby/vm_eval.c:86
#16 rb_call0 (ec=0x7fb689a22050, recv=140421727738720, mid=3121, argc=4, argv=<optimized out>, call_scope=<optimized out>, self=140422964243000) at /usr/src/ruby/vm_eval.c:550
#17 0x00007fb6ca294886 in rb_call (scope=CALL_FCALL, argv=0x7fb68cefd998, argc=4, mid=3121, recv=140421727738720) at /usr/src/ruby/vm_eval.c:876
#18 rb_funcallv_kw (recv=recv@entry=140421727738720, mid=mid@entry=3121, argc=argc@entry=4, argv=argv@entry=0x7fb68cefd998, kw_splat=kw_splat@entry=0) at /usr/src/ruby/vm_eval.c:1073
#19 0x00007fb6ca0a24e4 in rb_obj_call_init_kw (obj=obj@entry=140421727738720, argc=argc@entry=4, argv=argv@entry=0x7fb68cefd998, kw_splat=kw_splat@entry=0) at eval.c:1575
#20 0x00007fb6ca1582ee in rb_class_new_instance_kw (argc=argc@entry=4, argv=argv@entry=0x7fb68cefd998, klass=klass@entry=140422964243000, kw_splat=0) at object.c:2003
#21 0x00007fb6ca0d9d87 in rb_io_s_open (argc=4, argv=0x7fb68cefd998, klass=140422964243000) at io.c:7477

Our environment is as follows:

  • AWS ECS
  • Ubuntu 22.04
  • Ruby: 3.1.3 (MRI)
  • Rails: 6.1.7.4
  • Resque: 1.27.2.1
  • ddtrace: tested versions 1.2, 1.3, 1.11, and 1.12 (only 1.2.0 does not exhibit the issues described above)
  • Glibc: 2.35

Our ddtrace initialization config:

options = {analytics_enabled: false}

Datadog.configure do |c|
  c.runtime_metrics.enabled = true
  c.tracing.analytics.enabled = true
  c.profiling.enabled = true

  c.tracing.instrument :action_mailer, options
  c.tracing.instrument :action_pack, options
  c.tracing.instrument :action_view, options
  c.tracing.instrument :active_model_serializers, options
  c.tracing.instrument :active_record, options
  c.tracing.instrument :active_support, options
  c.tracing.instrument :active_job, options
  c.tracing.instrument :aws, options
  c.tracing.instrument :dalli, options
  c.tracing.instrument :delayed_job, options
  c.tracing.instrument :faraday, options
  c.tracing.instrument :rails, options

  c.tracing.instrument :racecar, options
  c.tracing.instrument :redis, options
  c.tracing.instrument :resque, options

  c.tracing.instrument :rack, options.merge(
    analytics_enabled: true,
    request_queuing: true,
  )
end

Datadog::Tracing.before_flush do |trace|
  trace.spans.each { |span| span.set_tag('custom.request_uuid', RequestStore.store[:request_uuid]) }
  trace
end

It's also worth noting that I was only able to replicate this behavior on AWS ECS. I tried reproducing it on a local Linux desktop (Arch rather than Ubuntu) and even when running 10 Resque workers could not replicate the issue. I don't know if that makes it environment specific or, since it only happens in high-load, that the considerably higher hardware capabilities of my desktop compared to the limited AWS containers prevented it from occurring.

I've also reviewed #466 which seems like a similar issue, but appears to be unrelated.

Overall, I've spent nearly two weeks getting to this point hoping to identify/solve the issue, but at least to obtain enough information to intelligently open a bug report. At this point I'm not sure what else to debug and am hoping for advice on what might be causing these deadlocks and how to further diagnose the problem. Thank you for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugInvolves a bugcommunityWas opened by a community member

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions