Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidekiq client-side tracing #602

Merged
merged 12 commits into from
Nov 26, 2018
Merged

Conversation

dirk
Copy link
Contributor

@dirk dirk commented Nov 1, 2018

Sidekiq enqueues are normally fast, but it seems like it'd be nice to have tracing of those enqueue/push events and therefore be able to monitor the latency of enqueues in case they aren't fast (as these can often be a good warning sign of Redis issues).

Note that I made a number of guesses here, especially concerning the span name and omitting the service (ie. that client-side enqueues should inherit their parent's service).

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! We love to see improvements like this to our instrumentation.

Overall I think this is looking pretty good, just a few points of feedback. I think the span name seems right, but it's important its assigned a service name.

For that, we'll need to consider a couple of scenarios, which we'll have to factor into the behavior we want to see:

  1. This sidekiq.push span is the top-level span. This could happen realistically if, say, a Ruby script or Rake task is written which only queues jobs. It will need a default service name in such a case. perhaps we adopt the default Sidekiq service name, or perhaps something different.
  2. The sidekiq.push span is nested under another span/service (the most common case, I think.) It might make sense to make it belong to the parent span's service (we do something like this with ActiveRecord), but is that the best place to categorize it? I tend to agree that might be the best solution here, but want to consider this a bit more carefully. If we do want to go in this direction, you'll probably need to use span.parent.service and do some kind of check if there is no parent.

lib/ddtrace/contrib/sidekiq/client_tracer.rb Outdated Show resolved Hide resolved
require 'ddtrace/contrib/sidekiq/client_tracer'
::Sidekiq.configure_client do |config|
config.client_middleware do |chain|
chain.add(Sidekiq::ClientTracer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

lib/ddtrace/contrib/sidekiq/client_tracer.rb Outdated Show resolved Hide resolved
@delner delner added integrations Involves tracing integrations community Was opened by a community member feature Involves a product feature labels Nov 2, 2018
@dirk
Copy link
Contributor Author

dirk commented Nov 6, 2018

For that, we'll need to consider a couple of scenarios, which we'll have to factor into the behavior we want to see:

@delner: Thanks for your kind and thoughtful review! 💖 This question re. service attribute is something I was struggling to figure out. After reading your comments and further reflection, I think it makes the most sense for the service to be the Sidekiq service as what this is measuring is mostly a) time spent talking to the Redis server backing Sidekiq and sometimes b) JSON serialization if you have a really big job payload. Having that time-on-the-wire be attributed to the service (instead of the parent span's service) matches how this library currently attributes ActiveRecord SQL execution. How does all that sound?

Since there was now so much common functionality between the two tracers I abstracted that out into a BaseTracer class. Please let me know if there's a different design pattern y'all would prefer!

module Datadog
module Contrib
module Sidekiq
class BaseTracer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we turn this into a module instead of a class? (For composition vs inheritance?) We could probably name this something like module Tracing so the classes instead read:

class ClientTracer
  include Tracing
end

class ServerTracer
  include Tracing
end

@delner
Copy link
Contributor

delner commented Nov 6, 2018

Thanks for your hard work on this @dirk ! This is definitely shaping up. I really like that you extracted the common code for re-usability; that will help us maintain the code more consistently.

I think the implementation strategy for service as you suggested is reasonable. I want to check on one more thing on my end before we're 100% committed to that, but I think in all likelihood that's how we'll do it. And if for some reason that isn't what we want, we can always change it later.

In the mean time, if we can get the build fixed up and this rebased on 0.18-dev, then I think we can target this for an 0.18.0 release in the next few weeks.

@delner delner added this to the 0.18.0 milestone Nov 6, 2018
@dirk dirk force-pushed the sidekiq-client-tracer branch 2 times, most recently from 0deff48 to 82b75ab Compare November 7, 2018 01:43
@dirk dirk changed the base branch from master to 0.18-dev November 7, 2018 01:44
@dirk
Copy link
Contributor Author

dirk commented Nov 7, 2018

@delner: Comments addressed and branch rebased (and base changed on GitHub)! I also noticed that CI is failing for this PR and other recently-updated PRs because a new version of concurrent-ruby came out, so I opened PR #607 to (hopefully) fix that. 🙃

@delner
Copy link
Contributor

delner commented Nov 7, 2018

Thanks for updating this so quickly! That build issue should be fixed with 1.1.3 of concurrent-ruby, so if you change or add a commit SHA with a rebase/force push, the CI build should invalidate the cache and pass now (hopefully.)

@dirk
Copy link
Contributor Author

dirk commented Nov 9, 2018

@delner: CI's happy! 🙃

# https://github.com/mperham/sidekiq/wiki/Middleware#client-middleware
def call(worker_class, job, queue, redis_pool)
resource = job_resource(job)
service = sidekiq_service(resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following up on what service names should be, we should 1) avoid having the service match the resource, and 2) the server/client service names matching each other.

For number 1, the service name should act as a group name for a logical deployment of Sidekiq (e.g. application), as users will want to see application performance statistics aggregated by role first, then broken down by task (which is where resource comes into play.) Users should be able to configure the client and service names separately from one another, so to support that, we might need to add in another configuration option like client_service_name or something similar.

For number 2, the client shouldn't match the server because otherwise we see some conflation of statistics between the two. We should probably keep the default service name for the server side as sidekiq (so as to not break user's dependency on this existing default), and give the client service a new default, perhaps something like sidekiq-client or similar.

Copy link
Contributor

@delner delner Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay, I may have misread the implementation here a bit. I see you're not copying the resource in for the service name, but mapping it to some other service name that might exist on the class in question. This starts to cross into what we've referred to as "multiplexing"; mapping tracer configuration based on trace content.

Is this important to implement in this PR? (The mapping resource --> service name on the job?) I might suggest keeping it simple with straight configuration variables from Datadog.configuration[:sidekiq] for now, since this is a bit more complicated, and we have a new prescribed first-class means of doing such mappings as a part of the integration's configuration (such as what was done in #451.)

If we want to simplify this part a bit, we can probably get this PR merged and open another one for the multiplexing feature. Might be the easiest way to get parts of this out more quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] This starts to cross into what we've referred to as "multiplexing"; mapping tracer configuration based on trace content.

@delner: Thanks again for taking the time to share all this background context; I realize I was following existing patterns without understanding they were something y'all were refactoring.

If we want to simplify this part a bit, we can probably get this PR merged and open another one for the multiplexing feature. Might be the easiest way to get parts of this out more quickly.

I think I achieved this in 1cdebfd. It sets up a separate configuration for the client service name (defaulted to "sidekiq-client") and maintains the ability to define a .datadog_tracer_config worker class method for setting the service when on a Sidekiq server (so that we don't regress on that feature).

@dirk
Copy link
Contributor Author

dirk commented Nov 15, 2018

@delner: Thanks for taking the time to share so much context! I've read over your comments once, and I think I need to give them a couple more read-overs to make sure I understand it all. I plan on getting back to this PR and getting those comments addressed tomorrow or this weekend. 🙃

@dirk
Copy link
Contributor Author

dirk commented Nov 17, 2018

@delner: Comments addressed, rebased on 0.18-dev, and CI is happy! 🙂

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good! I'm very happy with the changes and the new functionality we're adding here.

I left a few comments that mostly concerns old implementation that was refactored here, but nothing that should be a blocker. Any comments or thoughts on those are welcome, otherwise I think this one is ready.

Thank you so much for your hard work and contribution! 🎉


private

def service_from_worker_config(resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self (and out of scope for this PR): we should remove this. We don't want jobs defining tracer functions on themselves; this should be handled in our integration configuration (probably by leveraging the new multiplexing feature.)

As this is apparently existing behavior from the previous file (before you refactored this), this isn't a blocker. We'll do this in a future PR.

def service_from_worker_config(resource)
# Try to get the Ruby class from the resource name.
worker_klass = begin
Object.const_get(resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also old code but... can you think of any scenarios where the resource (i.e. class name) would not exist in the namespace?

If failing to resolve a constant and raising an error is a very common case, then this could be incurring a significant performance cost. If that is true, we'd want to fix this (now or at some point in the future.) Otherwise if this is just an edge case, and a safety of sorts, then I don't think we need to worry about this.

This isn't a blocker for this PR most likely, since it's existing behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@delner: This is adapted from the original tracer implementation, but given your comment above I think this seems like a good enough time as any to remove support for workers defining a .datadog_tracer_config method (and therefore not need this Object.const_get). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we'll want to remove this too, but to do so, we'll have to deprecate the feature and remove it in a future release. I've added an item to our backlog to do each of these things in future PRs.

@delner delner merged commit 3ce6f3a into DataDog:0.18-dev Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Was opened by a community member feature Involves a product feature integrations Involves tracing integrations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants