Skip to content

Resque workers start hanging with activated ddtrace #466

Closed

Description

We are currently in the process of rolling out datadog over our applications. However, we encountered a quite severe issue which we believe we tracked down to the ddtrace gem.

Observed behaviour

Resque workers will eventually end in a state where they are deadlocked. This means the worker is alive, but the processing of the current job never finishes.
Subjectively this happens faster for workers that process lots of jobs (always have a full queue) than for workers that have not as much to do.

strace tells us, that the deadlocked processes are waiting for a user space mutex. One example:

$ strace -p 8806
Process 8806 attached - interrupt to quit
futex(0x7fc970d543c0, FUTEX_WAIT_PRIVATE, 2, NULL

When we remove the ddtrace gem and its configuration from our project, we don't get any deadlocking resque workers.

So far we observed this behaviour only on resque workers. We got no timeouts whatsoever for our web servers.

Environment

These are the versions of a few gems in that project:

ddtrace (0.12.1)
rails (3.2.22.5)
redis (3.3.3)
resque (1.27.4)

ruby 2.2.4

These are the specs for the application where we observed this issue, I will try to reproduce it in another environment with more recent ruby/rails versions.

edit: Reproduced the error on more environments by now

ddtrace (0.12.1)
ruby 2.2.4
rails (4.2.7.1)
redis (3.3.5)
resque (1.27.4)
ddtrace (0.13.0)
ruby 2.5.0
rails (5.1.6)
redis (4.0.1)
resque (1.27.4)

This is our datadog initializer:

yml_config = YAML.safe_load(File.read("#{Rails.root}/config/datadog.yml"))
                 .fetch(Rails.env)
                 .symbolize_keys

service = ->(name) { "#{yml_config[:service_name]}-#{name}" }

Datadog.configure do |c|
  c.tracer enabled: yml_config[:enabled]
  c.use :rails, service_name: yml_config[:service_name]
  c.use :graphql, service_name: service.call('graphql'),
                  schemas: [KaeuferportalSchema]
  c.use :http, service_name: service.call('external')
  c.use :redis, service_name: service.call('redis')

  # Note: The error also occurs with the line below being deleted
  c.use :resque, service_name: service.call('resque'), workers: [ApplicationJob]
end

Bisecting

We updated ddtrace from 0.12.0 to 0.12.1 when encountering the issue, so both versions are affected for us.

Before the problems started we already had ddtrace running on the affected application, but in version 0.11.4 and without explicitly activating any integration (thus just rails + whatever that loads automatically were active). In that state we have not had any problems.

Thanks for your support, tell me if you need any further information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

bugInvolves a bugcommunityWas opened by a community member

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions