-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sidekiq client-side tracing #602
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! We love to see improvements like this to our instrumentation.
Overall I think this is looking pretty good, just a few points of feedback. I think the span name seems right, but it's important its assigned a service name.
For that, we'll need to consider a couple of scenarios, which we'll have to factor into the behavior we want to see:
- This
sidekiq.push
span is the top-level span. This could happen realistically if, say, a Ruby script or Rake task is written which only queues jobs. It will need a default service name in such a case. perhaps we adopt the default Sidekiq service name, or perhaps something different. - The
sidekiq.push
span is nested under another span/service (the most common case, I think.) It might make sense to make it belong to the parent span's service (we do something like this with ActiveRecord), but is that the best place to categorize it? I tend to agree that might be the best solution here, but want to consider this a bit more carefully. If we do want to go in this direction, you'll probably need to usespan.parent.service
and do some kind of check if there is no parent.
require 'ddtrace/contrib/sidekiq/client_tracer' | ||
::Sidekiq.configure_client do |config| | ||
config.client_middleware do |chain| | ||
chain.add(Sidekiq::ClientTracer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@delner: Thanks for your kind and thoughtful review! 💖 This question re. service attribute is something I was struggling to figure out. After reading your comments and further reflection, I think it makes the most sense for the service to be the Sidekiq service as what this is measuring is mostly a) time spent talking to the Redis server backing Sidekiq and sometimes b) JSON serialization if you have a really big job payload. Having that time-on-the-wire be attributed to the service (instead of the parent span's service) matches how this library currently attributes ActiveRecord SQL execution. How does all that sound? Since there was now so much common functionality between the two tracers I abstracted that out into a |
module Datadog | ||
module Contrib | ||
module Sidekiq | ||
class BaseTracer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we turn this into a module instead of a class? (For composition vs inheritance?) We could probably name this something like module Tracing
so the classes instead read:
class ClientTracer
include Tracing
end
class ServerTracer
include Tracing
end
Thanks for your hard work on this @dirk ! This is definitely shaping up. I really like that you extracted the common code for re-usability; that will help us maintain the code more consistently. I think the implementation strategy for service as you suggested is reasonable. I want to check on one more thing on my end before we're 100% committed to that, but I think in all likelihood that's how we'll do it. And if for some reason that isn't what we want, we can always change it later. In the mean time, if we can get the build fixed up and this rebased on |
0deff48
to
82b75ab
Compare
Thanks for updating this so quickly! That build issue should be fixed with |
82b75ab
to
3b159e1
Compare
@delner: CI's happy! 🙃 |
# https://github.com/mperham/sidekiq/wiki/Middleware#client-middleware | ||
def call(worker_class, job, queue, redis_pool) | ||
resource = job_resource(job) | ||
service = sidekiq_service(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just following up on what service names should be, we should 1) avoid having the service match the resource, and 2) the server/client service names matching each other.
For number 1, the service name should act as a group name for a logical deployment of Sidekiq (e.g. application), as users will want to see application performance statistics aggregated by role first, then broken down by task (which is where resource comes into play.) Users should be able to configure the client and service names separately from one another, so to support that, we might need to add in another configuration option like client_service_name
or something similar.
For number 2, the client shouldn't match the server because otherwise we see some conflation of statistics between the two. We should probably keep the default service name for the server side as sidekiq
(so as to not break user's dependency on this existing default), and give the client service a new default, perhaps something like sidekiq-client
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, okay, I may have misread the implementation here a bit. I see you're not copying the resource in for the service name, but mapping it to some other service name that might exist on the class in question. This starts to cross into what we've referred to as "multiplexing"; mapping tracer configuration based on trace content.
Is this important to implement in this PR? (The mapping resource --> service name on the job?) I might suggest keeping it simple with straight configuration variables from Datadog.configuration[:sidekiq]
for now, since this is a bit more complicated, and we have a new prescribed first-class means of doing such mappings as a part of the integration's configuration (such as what was done in #451.)
If we want to simplify this part a bit, we can probably get this PR merged and open another one for the multiplexing feature. Might be the easiest way to get parts of this out more quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] This starts to cross into what we've referred to as "multiplexing"; mapping tracer configuration based on trace content.
@delner: Thanks again for taking the time to share all this background context; I realize I was following existing patterns without understanding they were something y'all were refactoring.
If we want to simplify this part a bit, we can probably get this PR merged and open another one for the multiplexing feature. Might be the easiest way to get parts of this out more quickly.
I think I achieved this in 1cdebfd. It sets up a separate configuration for the client service name (defaulted to "sidekiq-client"
) and maintains the ability to define a .datadog_tracer_config
worker class method for setting the service when on a Sidekiq server (so that we don't regress on that feature).
@delner: Thanks for taking the time to share so much context! I've read over your comments once, and I think I need to give them a couple more read-overs to make sure I understand it all. I plan on getting back to this PR and getting those comments addressed tomorrow or this weekend. 🙃 |
Original rename commit: 4f57b00
Don't yet fully understand the best value to use here, so it's prudent to just leave it out until we figure it out.
1cdebfd
to
d84dfd2
Compare
@delner: Comments addressed, rebased on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very good! I'm very happy with the changes and the new functionality we're adding here.
I left a few comments that mostly concerns old implementation that was refactored here, but nothing that should be a blocker. Any comments or thoughts on those are welcome, otherwise I think this one is ready.
Thank you so much for your hard work and contribution! 🎉
|
||
private | ||
|
||
def service_from_worker_config(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self (and out of scope for this PR): we should remove this. We don't want jobs defining tracer functions on themselves; this should be handled in our integration configuration (probably by leveraging the new multiplexing feature.)
As this is apparently existing behavior from the previous file (before you refactored this), this isn't a blocker. We'll do this in a future PR.
def service_from_worker_config(resource) | ||
# Try to get the Ruby class from the resource name. | ||
worker_klass = begin | ||
Object.const_get(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also old code but... can you think of any scenarios where the resource (i.e. class name) would not exist in the namespace?
If failing to resolve a constant and raising an error is a very common case, then this could be incurring a significant performance cost. If that is true, we'd want to fix this (now or at some point in the future.) Otherwise if this is just an edge case, and a safety of sorts, then I don't think we need to worry about this.
This isn't a blocker for this PR most likely, since it's existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@delner: This is adapted from the original tracer implementation, but given your comment above I think this seems like a good enough time as any to remove support for workers defining a .datadog_tracer_config
method (and therefore not need this Object.const_get
). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we'll want to remove this too, but to do so, we'll have to deprecate the feature and remove it in a future release. I've added an item to our backlog to do each of these things in future PRs.
Sidekiq enqueues are normally fast, but it seems like it'd be nice to have tracing of those enqueue/push events and therefore be able to monitor the latency of enqueues in case they aren't fast (as these can often be a good warning sign of Redis issues).
Note that I made a number of guesses here, especially concerning the span name and omitting the service (ie. that client-side enqueues should inherit their parent's service).