Enable priority sampling #229

p-lambert · 2017-10-23T16:40:42Z

Overview

Introduces Priority Sampling to force the sampler to keep a trace (or to discard it). This includes Distributed Tracing propagation for the priority sampling attribute.

ufoot

I think the approach is fine, left a few comments but need to see this in action. Basically, the missing brick is getting the actual rates from the agent before injecting it into the priority sampler. But this could be done in another PR.

ufoot · 2017-11-13T19:30:16Z

lib/ddtrace/sampler.rb

+    def initialize(rate = 1.0, opts = {})
+      @env = opts.fetch(:env, Datadog.tracer.tags[:env])
+      @mutex = Mutex.new
+      @fallback = RateSampler.new(rate)


It's perfectly fine to have a fallback and identify it as such, however keep in mind that the agent will report a value for this (typically with a key service:,env:) and this should be used as the library can't really have a clue of this default "fallback" value unless it queries the agent. Eg: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/sampler.py#L90

Oh, I see! OK, I think I'll have to change this. Thanks for spotting that 👍

ufoot · 2017-11-13T19:39:13Z

lib/ddtrace/sampler.rb

+    end
+
+    def sample(span)
+      span.sampling_priority = 0


Generally speaking, while writing the Python impl. we acknowledged, at some point, that it was easier and made more sense API-wise to have all the sample() family of methods return true or false and not modify the object in place. As in: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/sampler.py#L22 It's easier to test and the caller can make whatever decision on what to do, also, span becomes a read-only parameter, which has some advantages. This involves quite a rewriting of code at this step (because the existing sampler should be updated, not only the Priority one, but I think it's worth it). Might want to ask for @palazzem advice here.

Humn, I guess I'll have to wait for @palazzem, because I might be drifting too much from the Python implementation then.

I agree that the #sample return value should be a boolean, but I'd rather have a function with side effects here than breaking encapsulation by having different code paths in the caller for each concrete implementation of the Sampler – I think if we have a common API for samplers, we should leverage that and make them transparent to the caller.

ufoot

GTM, as a side note I think we might need some more high-level integration tests here, as at this stage most tests are relying on mocks. But this could be saved for the upcoming PR glueing all the priority sampling parts together.

ufoot · 2017-11-14T18:20:28Z

lib/ddtrace/tracer.rb

@@ -118,11 +118,17 @@ def configure(options = {})
      hostname = options.fetch(:hostname, nil)
      port = options.fetch(:port, nil)
      sampler = options.fetch(:sampler, nil)
+      priority_sampling = options[:priority_sampling]


ufoot · 2017-11-14T18:21:08Z

lib/ddtrace/tracer.rb

      @enabled = enabled unless enabled.nil?
      @writer.transport.hostname = hostname unless hostname.nil?
      @writer.transport.port = port unless port.nil?
      @sampler = sampler unless sampler.nil?
+
+      if priority_sampling
+        @sampler = PrioritySampler.new(base_sampler: @sampler)


OK, cool, this way, we can chain them.

ufoot · 2017-11-14T18:31:05Z

lib/ddtrace/sampler.rb

+      true
+    end
+
+    def_delegators :@post_sampler, :update


OK so that's the entry point to udate data from JSON agent output.

palazzem

We should add some documentation for that such as: http://pypi.datadoghq.com/trace/docs/#priority-sampling

palazzem · 2017-11-21T16:35:55Z

lib/ddtrace/sampler.rb

    end
+
+    def sample(span)
+      span.sampling_priority = 0


the sampling_priority attribute must be in the Context rather than Span. In that way you can set it with something similar to:

span.context.sampling_priority = 2

And at the end when the Context is finished, you just attach that metric to the root span (first element of the trace because it's a list). In Python we do: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/context.py#L148-L167

p-lambert · 2017-11-21T21:34:31Z

lib/ddtrace/context.rb

@@ -133,12 +133,14 @@ def sampled?
    # can be re-used immediately.
    #
    # This operation is thread-safe.
-    def get
+    def get(priority_sampling = false)


I turned this into an argument because it would be tricky to get the value of priority_sampling from the Tracer on every instantiation of Context objects. If you have any other ideas, let me know.

p-lambert · 2017-11-21T21:35:27Z

lib/ddtrace/ext/distributed.rb

@@ -6,6 +6,7 @@ module DistributedTracing
      HTTP_HEADER_TRACE_ID = 'x-datadog-trace-id'.freeze
      HTTP_HEADER_PARENT_ID = 'x-datadog-parent-id'.freeze
      HTTP_HEADER_SAMPLING_PRIORITY = 'x-datadog-sampling-priority'.freeze
+      SAMPLING_PRIORITY_KEY = '_sampling_priority_v1'.freeze


I'm not sure if there is a better place for this constant. Let me know your thoughts.

here is fine; when moving to OpenTracing we'll use directly their constants.

I'm fine with it being here, also not totally happy, but think it's OK.

yes we can simply use OpenTracing tags later on, when our clients have a proper migration.

p-lambert · 2017-11-21T21:37:53Z

lib/ddtrace/propagation/http_propagator.rb

-        headers[HTTP_HEADER_SAMPLING_PRIORITY] = span.sampling_priority.to_s
-      end
-      env.merge! headers
-      env.delete(HTTP_HEADER_SAMPLING_PRIORITY) unless span.sampling_priority


@ufoot I know it would be good to keep this #delete. Any ideas how we could do this now? Perhaps this would be easier once we have the a globally accessible value for the priority_sampling flag.

ufoot

Left a few questions, let's talk about this. You raise interesting issues BTW.

ufoot · 2017-11-22T09:15:00Z

lib/ddtrace/context.rb

      @mutex.synchronize do
        return nil, nil unless check_finished_spans

        trace = @trace
        sampled = @sampled
+        attach_sampling_priority if sampled && priority_sampling


I'm not sure why you need to test priority_sampling here, at all. This "is priority sampling enabled" is useful when creating the tracer, so that we know wether start_span should add a priority sampling arg, but then, once it's been set, it's been set, there's, I think, no point to re-test it here again. What is the use case for this?

humn, I thought we had to check it here so we can avoid a scenario like this:

In my custom instrumentation, I have a code line like this:

span.context.sampling_priority = 3

But at some point, I want to disable the priority sampling altogether:

Datadog.tracer.configure(priority_sampling: false)

If we don't check for the priority sampling flag before flushing the trace, that line setting a sampling_priority will still affect the sampling upstream, right?

I don't understand what happens outside the client, so I might be wrong. But if that's not the case, I would also prefer to remove this check. We can take a stance as "if you want to disable priority_sampling altogether you shouldn't set it."

BTW, I thought this was the scenario you were trying to prevent in that delete call inside the HTTPPropagator.inject! method.

ufoot · 2017-11-22T09:15:33Z

lib/ddtrace/ext/distributed.rb

@@ -6,6 +6,7 @@ module DistributedTracing
      HTTP_HEADER_TRACE_ID = 'x-datadog-trace-id'.freeze
      HTTP_HEADER_PARENT_ID = 'x-datadog-parent-id'.freeze
      HTTP_HEADER_SAMPLING_PRIORITY = 'x-datadog-sampling-priority'.freeze
+      SAMPLING_PRIORITY_KEY = '_sampling_priority_v1'.freeze


I'm fine with it being here, also not totally happy, but think it's OK.

ufoot · 2017-11-22T09:21:44Z

lib/ddtrace/propagation/http_propagator.rb

-      end
-      env.merge! headers
-      env.delete(HTTP_HEADER_SAMPLING_PRIORITY) unless span.sampling_priority
+    def self.inject!(span, context, env)


So I would not add he context as an arg to this function. Because, we're going to remove it some day, for sure. The final state is something like: context has it all, trace_id, current span_id, and sampling priority. As in https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/context.py#L25

So in the current state, the Ruby impl. does not have full context support. Initially, I thought it was OK to keep the legacy "span-based" way of doing things. Which, actually, works, only you copy the information many times in all spans when once in the context should be enough. My point is: we need to choose, either keep it all in span, or move it all in context, but something in between is really uncomfortable. AFAIK in different places in Python/Ruby code there's this "it could be a span or a context, let's test at run-time" as in https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/tracer.py#L159 and I think this is really fine. But two args (span+context) will be troublesome to migrate I fear.

We should keep the compatibility that simplify our migration to OpenTracing (while not breaking a lot of stuff). In that case, the public API expects a span_context that in our case is our Context (this is an approximation like @ufoot said). Eventually the Span will have its SpanContext and our (Trace)Context will only be an implementation detail. For instance, In Python we're providing the propagator in that way: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/propagation/http.py

Reference for OpenTracing: https://github.com/opentracing/basictracer-python/blob/master/basictracer/tracer.py#L87

I'd suggest to expect a Context (so span.context that will be easy to migrate later) so:

the API is compliant with our Python implementation + OpenTracing

no confusion so users always pass a Context

since it's a new functionality, no breaking changes in the future

@ufoot @palazzem OK, makes total sense. I will change things so we receive only the context then.

ufoot

Once the CI issues are fixed, looks good to me.

ufoot · 2017-11-22T16:03:05Z

lib/ddtrace/span.rb

@@ -22,7 +22,7 @@ class Span
                  :start_time, :end_time,
                  :span_id, :trace_id, :parent_id,
                  :status, :sampled,
-                  :tracer, :context, :sampling_priority
+                  :tracer, :context


ufoot · 2017-11-22T16:10:17Z

test/propagation/http_propagator_test.rb

@@ -8,18 +8,20 @@ def test_inject!

    tracer.trace('caller') do |span|
      env = { 'something' => 'alien' }
-      Datadog::HTTPPropagator.inject!(span, env)
+      Datadog::HTTPPropagator.inject!(span.context, env)


Yes, span.context, perfect.

yes always use the context attached to the Span

palazzem · 2017-11-23T13:21:37Z

@ufoot @p-lambert before merging this one we should provide some documentation:

We should add some documentation for that such as:
http://pypi.datadoghq.com/trace/docs/#priority-sampling

ufoot · 2017-11-23T17:23:08Z

docs/GettingStarted.md

+* 1: The sampler automatically decided to keep the trace.
+* 2: The user asked the keep the trace.
+
+For now, priority sampling is disabled by default. Enabling it ensures that your sampled distributed traces will be complete. To enable the priorty sampling:


Typo priorty -> priority

ufoot · 2017-11-23T17:30:07Z

docs/GettingStarted.md

@@ -719,6 +719,32 @@ overhead.
    sampler = Datadog::RateSampler.new(0.5) # sample 50% of the traces
    Datadog.tracer.configure(sampler: sampler)

+#### Priority sampling
+
+Priority sampling consists in deciding if a trace will be kept by using a priority attribute that will be propagated for distributed traces. Its value gives indication to the Agent and to the backend on how important the trace is.


I think it's important to outline that priority sampling is interesting in the case of distributed tracing. Something as Priority sampling should be turned on when using [distributed tracing](#Distributed_Tracing), as it ensures all parts of a given trace are kept together and sent to the backend. maybe ? Just a thought.

@ufoot I think that's a good idea. Do you think we should add that paragraph after the current one or replaced it altogether?

ufoot · 2017-11-27T16:43:26Z

lib/ddtrace/context.rb

@@ -50,6 +50,7 @@ def sampling_priority

    def sampling_priority=(priority)
      @mutex.synchronize do
+        @sampled = true


This does not work, for 2 reasons:

I think it would put sampled to true even if the arg (priority) is nil or 0

in all cases it breaks the stats, because it alters the numbers of traces we send without changing the sample_rate metric (from which we deduce the weight). In short, it breaks the stats.

ufoot

GTG

p-lambert requested a review from palazzem October 23, 2017 16:40

p-lambert force-pushed the pedro/service-sampler branch from a634c32 to 4656856 Compare October 23, 2017 16:52

palazzem added this to the 0.10.0 milestone Oct 24, 2017

palazzem added the core Involves Datadog core libraries label Oct 24, 2017

p-lambert changed the title ~~Add RateByServiceSampler~~ Enable priority sampling Nov 3, 2017

p-lambert added the do-not-merge/WIP Not ready for merge label Nov 3, 2017

ufoot self-requested a review November 9, 2017 19:45

ufoot reviewed Nov 13, 2017

View reviewed changes

p-lambert force-pushed the pedro/service-sampler branch from 9114773 to 05c893a Compare November 14, 2017 15:19

p-lambert removed the do-not-merge/WIP Not ready for merge label Nov 14, 2017

ufoot previously approved these changes Nov 14, 2017

View reviewed changes

p-lambert dismissed ufoot’s stale review via 6a60209 November 14, 2017 18:58

p-lambert force-pushed the pedro/service-sampler branch 2 times, most recently from 6a60209 to ff8f009 Compare November 15, 2017 20:05

palazzem suggested changes Nov 21, 2017

View reviewed changes

p-lambert commented Nov 21, 2017

View reviewed changes

p-lambert force-pushed the pedro/service-sampler branch from 63238b8 to 85d7ba0 Compare November 21, 2017 21:46

ufoot suggested changes Nov 22, 2017

View reviewed changes

p-lambert force-pushed the pedro/service-sampler branch from 85d7ba0 to 2eeacfa Compare November 22, 2017 15:45

ufoot previously approved these changes Nov 22, 2017

View reviewed changes

p-lambert dismissed ufoot’s stale review via 3216ea5 November 22, 2017 16:45

p-lambert force-pushed the pedro/service-sampler branch from 2eeacfa to 3216ea5 Compare November 22, 2017 16:45

ufoot previously approved these changes Nov 23, 2017

View reviewed changes

ufoot mentioned this pull request Nov 23, 2017

[distributed tracing] disabling tracer disables distributed headers #254

Merged

p-lambert dismissed ufoot’s stale review via 76358a2 November 23, 2017 16:25

ufoot reviewed Nov 23, 2017

View reviewed changes

Add RateByServiceSampler

0eb1699

p-lambert and others added 9 commits November 27, 2017 11:17

Add Samplers::PrioritySampler

4cfa68c

Fix fallback sampling behavior

bf458da

Add tests for PrioritySampler

33b73d6

Add webmock development dependency

f47da64

Add feedback loop to sampler

55eafd2

Move sampling_priority to Context

3f84a81

Add documentation for priority sampling

ef75031

Add integration test for priority_sampling

1ed97b9

[docs] fixed typo

04726f9

p-lambert force-pushed the pedro/service-sampler branch from 8397d4f to fd3a914 Compare November 27, 2017 16:18

ufoot suggested changes Nov 27, 2017

View reviewed changes

p-lambert force-pushed the pedro/service-sampler branch from f1e7723 to fd3a914 Compare November 27, 2017 16:53

[priority sampling] fixed integration test

1038ae1

p-lambert force-pushed the pedro/service-sampler branch from fd3a914 to 1038ae1 Compare November 27, 2017 17:01

ufoot approved these changes Nov 27, 2017

View reviewed changes

palazzem approved these changes Nov 28, 2017

View reviewed changes

palazzem merged commit 2078ab4 into master Nov 28, 2017

palazzem deleted the pedro/service-sampler branch November 28, 2017 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable priority sampling #229

Enable priority sampling #229

p-lambert commented Oct 23, 2017 •

edited by palazzem

Loading

ufoot left a comment

ufoot Nov 13, 2017

p-lambert Nov 13, 2017

ufoot Nov 13, 2017

p-lambert Nov 13, 2017 •

edited

Loading

ufoot left a comment

ufoot Nov 14, 2017

ufoot Nov 14, 2017

ufoot Nov 14, 2017

palazzem left a comment

palazzem Nov 21, 2017

p-lambert Nov 21, 2017

p-lambert Nov 21, 2017

palazzem Nov 22, 2017

ufoot Nov 22, 2017

palazzem Nov 28, 2017

p-lambert Nov 21, 2017 •

edited

Loading

ufoot left a comment

ufoot Nov 22, 2017

p-lambert Nov 22, 2017

ufoot Nov 22, 2017

ufoot Nov 22, 2017

palazzem Nov 22, 2017

p-lambert Nov 22, 2017 •

edited

Loading

ufoot left a comment

ufoot Nov 22, 2017

ufoot Nov 22, 2017

palazzem Nov 23, 2017

palazzem commented Nov 23, 2017

ufoot Nov 23, 2017

ufoot Nov 23, 2017

p-lambert Nov 27, 2017

ufoot Nov 27, 2017

ufoot left a comment

Enable priority sampling #229

Enable priority sampling #229

Conversation

p-lambert commented Oct 23, 2017 • edited by palazzem Loading

Overview

ufoot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

ufoot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

palazzem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

ufoot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

ufoot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

palazzem commented Nov 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ufoot left a comment

Choose a reason for hiding this comment

p-lambert commented Oct 23, 2017 •

edited by palazzem

Loading

p-lambert Nov 13, 2017 •

edited

Loading

p-lambert Nov 21, 2017 •

edited

Loading

p-lambert Nov 22, 2017 •

edited

Loading