-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable priority sampling #229
Conversation
a634c32
to
4656856
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach is fine, left a few comments but need to see this in action. Basically, the missing brick is getting the actual rates from the agent before injecting it into the priority sampler. But this could be done in another PR.
lib/ddtrace/sampler.rb
Outdated
def initialize(rate = 1.0, opts = {}) | ||
@env = opts.fetch(:env, Datadog.tracer.tags[:env]) | ||
@mutex = Mutex.new | ||
@fallback = RateSampler.new(rate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's perfectly fine to have a fallback and identify it as such, however keep in mind that the agent will report a value for this (typically with a key service:,env:
) and this should be used as the library can't really have a clue of this default "fallback" value unless it queries the agent. Eg: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/sampler.py#L90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see! OK, I think I'll have to change this. Thanks for spotting that 👍
lib/ddtrace/sampler.rb
Outdated
end | ||
|
||
def sample(span) | ||
span.sampling_priority = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, while writing the Python impl. we acknowledged, at some point, that it was easier and made more sense API-wise to have all the sample()
family of methods return true
or false
and not modify the object in place. As in: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/sampler.py#L22 It's easier to test and the caller can make whatever decision on what to do, also, span becomes a read-only parameter, which has some advantages. This involves quite a rewriting of code at this step (because the existing sampler should be updated, not only the Priority one, but I think it's worth it). Might want to ask for @palazzem advice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humn, I guess I'll have to wait for @palazzem, because I might be drifting too much from the Python implementation then.
I agree that the #sample
return value should be a boolean, but I'd rather have a function with side effects here than breaking encapsulation by having different code paths in the caller for each concrete implementation of the Sampler
– I think if we have a common API for samplers, we should leverage that and make them transparent to the caller.
9114773
to
05c893a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GTM, as a side note I think we might need some more high-level integration tests here, as at this stage most tests are relying on mocks. But this could be saved for the upcoming PR glueing all the priority sampling parts together.
@@ -118,11 +118,17 @@ def configure(options = {}) | |||
hostname = options.fetch(:hostname, nil) | |||
port = options.fetch(:port, nil) | |||
sampler = options.fetch(:sampler, nil) | |||
priority_sampling = options[:priority_sampling] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
lib/ddtrace/tracer.rb
Outdated
@enabled = enabled unless enabled.nil? | ||
@writer.transport.hostname = hostname unless hostname.nil? | ||
@writer.transport.port = port unless port.nil? | ||
@sampler = sampler unless sampler.nil? | ||
|
||
if priority_sampling | ||
@sampler = PrioritySampler.new(base_sampler: @sampler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, cool, this way, we can chain them.
true | ||
end | ||
|
||
def_delegators :@post_sampler, :update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so that's the entry point to udate data from JSON agent output.
6a60209
to
ff8f009
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some documentation for that such as: http://pypi.datadoghq.com/trace/docs/#priority-sampling
lib/ddtrace/sampler.rb
Outdated
end | ||
|
||
def sample(span) | ||
span.sampling_priority = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sampling_priority
attribute must be in the Context
rather than Span
. In that way you can set it with something similar to:
span.context.sampling_priority = 2
And at the end when the Context
is finished, you just attach that metric to the root span (first element of the trace because it's a list). In Python we do: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/context.py#L148-L167
lib/ddtrace/context.rb
Outdated
@@ -133,12 +133,14 @@ def sampled? | |||
# can be re-used immediately. | |||
# | |||
# This operation is thread-safe. | |||
def get | |||
def get(priority_sampling = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I turned this into an argument because it would be tricky to get the value of priority_sampling
from the Tracer
on every instantiation of Context
objects. If you have any other ideas, let me know.
@@ -6,6 +6,7 @@ module DistributedTracing | |||
HTTP_HEADER_TRACE_ID = 'x-datadog-trace-id'.freeze | |||
HTTP_HEADER_PARENT_ID = 'x-datadog-parent-id'.freeze | |||
HTTP_HEADER_SAMPLING_PRIORITY = 'x-datadog-sampling-priority'.freeze | |||
SAMPLING_PRIORITY_KEY = '_sampling_priority_v1'.freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is a better place for this constant. Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is fine; when moving to OpenTracing we'll use directly their constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it being here, also not totally happy, but think it's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can simply use OpenTracing tags later on, when our clients have a proper migration.
headers[HTTP_HEADER_SAMPLING_PRIORITY] = span.sampling_priority.to_s | ||
end | ||
env.merge! headers | ||
env.delete(HTTP_HEADER_SAMPLING_PRIORITY) unless span.sampling_priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ufoot I know it would be good to keep this #delete
. Any ideas how we could do this now? Perhaps this would be easier once we have the a globally accessible value for the priority_sampling
flag.
63238b8
to
85d7ba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few questions, let's talk about this. You raise interesting issues BTW.
lib/ddtrace/context.rb
Outdated
@mutex.synchronize do | ||
return nil, nil unless check_finished_spans | ||
|
||
trace = @trace | ||
sampled = @sampled | ||
attach_sampling_priority if sampled && priority_sampling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why you need to test priority_sampling
here, at all. This "is priority sampling enabled" is useful when creating the tracer, so that we know wether start_span
should add a priority sampling arg, but then, once it's been set, it's been set, there's, I think, no point to re-test it here again. What is the use case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humn, I thought we had to check it here so we can avoid a scenario like this:
In my custom instrumentation, I have a code line like this:
span.context.sampling_priority = 3
But at some point, I want to disable the priority sampling altogether:
Datadog.tracer.configure(priority_sampling: false)
If we don't check for the priority sampling flag before flushing the trace, that line setting a sampling_priority
will still affect the sampling upstream, right?
I don't understand what happens outside the client, so I might be wrong. But if that's not the case, I would also prefer to remove this check. We can take a stance as "if you want to disable priority_sampling altogether you shouldn't set it."
BTW, I thought this was the scenario you were trying to prevent in that delete
call inside the HTTPPropagator.inject!
method.
@@ -6,6 +6,7 @@ module DistributedTracing | |||
HTTP_HEADER_TRACE_ID = 'x-datadog-trace-id'.freeze | |||
HTTP_HEADER_PARENT_ID = 'x-datadog-parent-id'.freeze | |||
HTTP_HEADER_SAMPLING_PRIORITY = 'x-datadog-sampling-priority'.freeze | |||
SAMPLING_PRIORITY_KEY = '_sampling_priority_v1'.freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it being here, also not totally happy, but think it's OK.
end | ||
env.merge! headers | ||
env.delete(HTTP_HEADER_SAMPLING_PRIORITY) unless span.sampling_priority | ||
def self.inject!(span, context, env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I would not add he context as an arg to this function. Because, we're going to remove it some day, for sure. The final state is something like: context has it all, trace_id, current span_id, and sampling priority. As in https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/context.py#L25
So in the current state, the Ruby impl. does not have full context support. Initially, I thought it was OK to keep the legacy "span-based" way of doing things. Which, actually, works, only you copy the information many times in all spans when once in the context should be enough. My point is: we need to choose, either keep it all in span, or move it all in context, but something in between is really uncomfortable. AFAIK in different places in Python/Ruby code there's this "it could be a span or a context, let's test at run-time" as in https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/tracer.py#L159 and I think this is really fine. But two args (span+context) will be troublesome to migrate I fear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep the compatibility that simplify our migration to OpenTracing (while not breaking a lot of stuff). In that case, the public API expects a span_context
that in our case is our Context
(this is an approximation like @ufoot said). Eventually the Span
will have its SpanContext
and our (Trace)Context
will only be an implementation detail. For instance, In Python we're providing the propagator in that way: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/propagation/http.py
Reference for OpenTracing: https://github.com/opentracing/basictracer-python/blob/master/basictracer/tracer.py#L87
I'd suggest to expect a Context
(so span.context
that will be easy to migrate later) so:
- the API is compliant with our Python implementation + OpenTracing
- no confusion so users always pass a
Context
- since it's a new functionality, no breaking changes in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
85d7ba0
to
2eeacfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the CI issues are fixed, looks good to me.
@@ -22,7 +22,7 @@ class Span | |||
:start_time, :end_time, | |||
:span_id, :trace_id, :parent_id, | |||
:status, :sampled, | |||
:tracer, :context, :sampling_priority | |||
:tracer, :context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -8,18 +8,20 @@ def test_inject! | |||
|
|||
tracer.trace('caller') do |span| | |||
env = { 'something' => 'alien' } | |||
Datadog::HTTPPropagator.inject!(span, env) | |||
Datadog::HTTPPropagator.inject!(span.context, env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, span.context
, perfect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes always use the context attached to the Span
2eeacfa
to
3216ea5
Compare
@ufoot @p-lambert before merging this one we should provide some documentation:
|
docs/GettingStarted.md
Outdated
* 1: The sampler automatically decided to keep the trace. | ||
* 2: The user asked the keep the trace. | ||
|
||
For now, priority sampling is disabled by default. Enabling it ensures that your sampled distributed traces will be complete. To enable the priorty sampling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo priorty
-> priority
@@ -719,6 +719,32 @@ overhead. | |||
sampler = Datadog::RateSampler.new(0.5) # sample 50% of the traces | |||
Datadog.tracer.configure(sampler: sampler) | |||
|
|||
#### Priority sampling | |||
|
|||
Priority sampling consists in deciding if a trace will be kept by using a priority attribute that will be propagated for distributed traces. Its value gives indication to the Agent and to the backend on how important the trace is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to outline that priority sampling is interesting in the case of distributed tracing. Something as Priority sampling should be turned on when using [distributed tracing](#Distributed_Tracing), as it ensures all parts of a given trace are kept together and sent to the backend.
maybe ? Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ufoot I think that's a good idea. Do you think we should add that paragraph after the current one or replaced it altogether?
8397d4f
to
fd3a914
Compare
lib/ddtrace/context.rb
Outdated
@@ -50,6 +50,7 @@ def sampling_priority | |||
|
|||
def sampling_priority=(priority) | |||
@mutex.synchronize do | |||
@sampled = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work, for 2 reasons:
- I think it would put
sampled
to true even if the arg (priority
) isnil
or0
- in all cases it breaks the stats, because it alters the numbers of traces we send without changing the
sample_rate
metric (from which we deduce the weight). In short, it breaks the stats.
f1e7723
to
fd3a914
Compare
fd3a914
to
1038ae1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GTG
Overview
Introduces Priority Sampling to force the sampler to keep a trace (or to discard it). This includes Distributed Tracing propagation for the priority sampling attribute.