Description
We are currently in the process of rolling out datadog over our applications. However, we encountered a quite severe issue which we believe we tracked down to the ddtrace
gem.
Observed behaviour
Resque workers will eventually end in a state where they are deadlocked. This means the worker is alive, but the processing of the current job never finishes.
Subjectively this happens faster for workers that process lots of jobs (always have a full queue) than for workers that have not as much to do.
strace
tells us, that the deadlocked processes are waiting for a user space mutex. One example:
$ strace -p 8806
Process 8806 attached - interrupt to quit
futex(0x7fc970d543c0, FUTEX_WAIT_PRIVATE, 2, NULL
When we remove the ddtrace
gem and its configuration from our project, we don't get any deadlocking resque workers.
So far we observed this behaviour only on resque workers. We got no timeouts whatsoever for our web servers.
Environment
These are the versions of a few gems in that project:
ddtrace (0.12.1)
rails (3.2.22.5)
redis (3.3.3)
resque (1.27.4)
ruby 2.2.4
These are the specs for the application where we observed this issue, I will try to reproduce it in another environment with more recent ruby/rails versions.
edit: Reproduced the error on more environments by now
ddtrace (0.12.1)
ruby 2.2.4
rails (4.2.7.1)
redis (3.3.5)
resque (1.27.4)
ddtrace (0.13.0)
ruby 2.5.0
rails (5.1.6)
redis (4.0.1)
resque (1.27.4)
This is our datadog initializer:
yml_config = YAML.safe_load(File.read("#{Rails.root}/config/datadog.yml"))
.fetch(Rails.env)
.symbolize_keys
service = ->(name) { "#{yml_config[:service_name]}-#{name}" }
Datadog.configure do |c|
c.tracer enabled: yml_config[:enabled]
c.use :rails, service_name: yml_config[:service_name]
c.use :graphql, service_name: service.call('graphql'),
schemas: [KaeuferportalSchema]
c.use :http, service_name: service.call('external')
c.use :redis, service_name: service.call('redis')
# Note: The error also occurs with the line below being deleted
c.use :resque, service_name: service.call('resque'), workers: [ApplicationJob]
end
Bisecting
We updated ddtrace
from 0.12.0
to 0.12.1
when encountering the issue, so both versions are affected for us.
Before the problems started we already had ddtrace
running on the affected application, but in version 0.11.4
and without explicitly activating any integration (thus just rails + whatever that loads automatically were active). In that state we have not had any problems.
Thanks for your support, tell me if you need any further information.