Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use most recent event for trace resource name #1695

Merged
merged 4 commits into from
Sep 27, 2021

Conversation

delner
Copy link
Contributor

@delner delner commented Sep 20, 2021

What does it change?

This pull request changes profiling to use the most recent event to derive the resource name for a particular trace.

When the pprof encoder for the profiler is given a set of events to export, it reduces them down to a minimal set that represents all the events produced. Since the most recent event for any given trace should reflect the most accurate resource name, we can use that as the canonical resource name for all events on the trace. So during this pprof building process, we can simply find which event is the most recent one for any given trace, and use that for the resource label.

By doing so, it makes the TraceResourceContainer obsolete, thus reverts this as well.

Why this change?

Overall it's meant to simply the trace internals, and the contract between tracing and profiling, while preserving (and even enhancing) the robustness of this profiling resource feature.

More specifically, the TraceResourceContainer has some drawbacks, and this solution is meant to address them:

  • It's not thread-safe (race condition exists if resource mutates)
  • It's not Ractor compatible (objects cannot be shared across Ractors)
  • The object reference easily broken by dup (if we copy a Span, for example.)
  • It makes Span more complicated (struct over value)
  • It acts as an invisible dependency on trace internals for profiling (no contract)
    • Removing/changing trace code around this won't break tracing, but it will break profiling.

@delner delner added core Involves Datadog core libraries profiling Involves Datadog profiling labels Sep 20, 2021
@delner delner self-assigned this Sep 20, 2021
@delner delner requested a review from a team September 20, 2021 20:31
Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. I left a minor suggestion.

Co-authored-by: Marco Costa <marco.costa@datadoghq.com>
@ivoanjo
Copy link
Member

ivoanjo commented Sep 22, 2021

Hey!

One important thing to notice is that this approach is not equivalent to the previous one. In particular, the description states:

Since the most recent event for any given trace should reflect the most accurate resource name, we can use that as the canonical resource name for all events on the trace

This assumption can be broken if the profiler is not sampling "often enough". This is because for integrations where the resource name is set at the end, there's only a small-ish window of opportunity between the resouce name being set and the span finishing, and the profiler MUST sample during that time, or otherwise it will miss getting the information.

The previously implemented approach covers this gap, because it only requires that profile encoding occurs after the resource name being set, which is a much larger window of opportunity.

Here's an experiment that shows this happening for an example application:


Example application:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'

  gem 'google-protobuf'
  gem 'ddtrace', path: '.'
  gem 'rack'
  gem 'open-uri'
  gem 'dogstatsd-ruby'
end

Datadog.configure do |c|
  c.use :rack
end

def foo_request
  URI.open("http://127.0.0.1:8080/foo").read
end

def start_web_server
  print "Starting http server"

  Thread.new do
    Rack::Handler::WEBrick.run(
      Datadog::Contrib::Rack::TraceMiddleware.new(proc { |env| [200, {"Content-Type" => "text/plain"}, ["#{sleep 2}\n"]]}),
      Port: 8080,
      AccessLog: [],
    )
  end

  while true
    begin
      foo_request
      puts "\nWeb server started!"
      return
    rescue Errno::ECONNREFUSED
      print '.'
    end
  end
end

start_web_server

puts "Running request loop..."
loop do
  foo_request
end

I additionally modified StackSample#build_sample_labels with the following diff:

diff --git a/lib/ddtrace/profiling/pprof/stack_sample.rb b/lib/ddtrace/profiling/pprof/stack_sample.rb
index 1d61bf34..fd6f8986 100644
--- a/lib/ddtrace/profiling/pprof/stack_sample.rb
+++ b/lib/ddtrace/profiling/pprof/stack_sample.rb
@@ -124,16 +124,18 @@ module Datadog
             ).trace_resource

             if trace_resource && !trace_resource.empty?
               labels << Perftools::Profiles::Label.new(
                 key: builder.string_table.fetch(Datadog::Ext::Profiling::Pprof::LABEL_KEY_TRACE_ENDPOINT),
                 str: builder.string_table.fetch(trace_resource)
               )
             end
+
+            puts "For #{span_id} => #{trace_resource.inspect}"
           end

           labels
         end

         def debug_statistics
           "unique stacks: #{@processed_unique_stacks}, of which had active traces: #{@processed_with_trace_ids}"
         end

Here's how it looks with this branch applied:

ivo.anjo@macieira:~/datadog/dd-trace-rb$ DD_PROFILING_ENABLED=true bundle exec ruby rack_experiment.rb
I, [2021-09-22T09:39:09.343607 #23823]  INFO -- ddtrace: [ddtrace] CPU time profiling skipped because native CPU time is not supported: Feature requires Linux; macOS is not supported. Profiles containing Wall time will still be reported.
Starting http server......................................................[2021-09-22 09:39:09] INFO  WEBrick 1.6.1
[2021-09-22 09:39:09] INFO  ruby 2.7.4 (2021-07-07) [x86_64-darwin19]
...[2021-09-22 09:39:09] INFO  WEBrick::HTTPServer#start: pid=23823 port=8080
.
Web server started!
Running request loop...
I, [2021-09-22T09:39:11.427245 #23823]  INFO -- ddtrace: [ddtrace] DATADOG TRACER CONFIGURATION - {"date":"2021-09-22T09:39:11+01:00","os_name":"x86_64-apple-darwin19","version":"0.52.0","lang":"ruby","lang_version":"2.7.4","enabled":true,"service":"rack_experiment","agent_url":"http://127.0.0.1:8126?timeout=1","debug":false,"analytics_enabled":false,"runtime_metrics_enabled":false,"integrations_loaded":"rack@2.2.3","vm":"ruby-2.7.4","partial_flushing_enabled":false,"priority_sampling_enabled":false,"health_metrics_enabled":false,"integration_rack_analytics_enabled":"","integration_rack_analytics_sample_rate":"1.0","integration_rack_enabled":"true","integration_rack_service_name":"rack","integration_rack_application":"","integration_rack_distributed_tracing":"true","integration_rack_headers":"{:response=>[\"Content-Type\", \"X-Request-ID\"]}","integration_rack_middleware_names":"false","integration_rack_quantize":"{}","integration_rack_request_queuing":"false","integration_rack_web_service_name":"web-server"}
For 749678520078825844 => nil
For 2316789782469005996 => nil
For 1583226573609442396 => nil
For 3100147924493212649 => nil
For 562007843322198173 => nil
For 544296391292291153 => nil
For 293530024743266260 => nil
For 3691776758708145319 => nil
For 2514798690741101004 => nil
For 2553392391833009263 => nil
For 1131526994745616392 => nil
For 4453356940358683320 => nil
For 69045168188392044 => nil
For 2723363878259952183 => nil
For 1434258367995627202 => nil
For 42551590484985441 => nil

(I left it running for a while, and did not see a single resource that was non-nil)

...and here's how it looks it master:

ivo.anjo@macieira:~/datadog/dd-trace-rb$ DD_PROFILING_ENABLED=true bundle exec ruby rack_experiment.rb
I, [2021-09-22T09:49:21.092234 #26424]  INFO -- ddtrace: [ddtrace] CPU time profiling skipped because native CPU time is not supported: Feature requires Linux; macOS is not supported. Profiles containing Wall time will still be reported.
Starting http server.................................................[2021-09-22 09:49:21] INFO  WEBrick 1.6.1
.[2021-09-22 09:49:21] INFO  ruby 2.7.4 (2021-07-07) [x86_64-darwin19]
..[2021-09-22 09:49:21] INFO  WEBrick::HTTPServer#start: pid=26424 port=8080
.
Web server started!
Running request loop...
I, [2021-09-22T09:49:23.177359 #26424]  INFO -- ddtrace: [ddtrace] DATADOG TRACER CONFIGURATION - {"date":"2021-09-22T09:49:23+01:00","os_name":"x86_64-apple-darwin19","version":"0.52.0","lang":"ruby","lang_version":"2.7.4","enabled":true,"service":"rack_experiment","agent_url":"http://127.0.0.1:8126?timeout=1","debug":false,"analytics_enabled":false,"runtime_metrics_enabled":false,"integrations_loaded":"rack@2.2.3","vm":"ruby-2.7.4","partial_flushing_enabled":false,"priority_sampling_enabled":false,"health_metrics_enabled":false,"integration_rack_analytics_enabled":"","integration_rack_analytics_sample_rate":"1.0","integration_rack_enabled":"true","integration_rack_service_name":"rack","integration_rack_application":"","integration_rack_distributed_tracing":"true","integration_rack_headers":"{:response=>[\"Content-Type\", \"X-Request-ID\"]}","integration_rack_middleware_names":"false","integration_rack_quantize":"{}","integration_rack_request_queuing":"false","integration_rack_web_service_name":"web-server"}
For 2190166331295394257 => "GET 200"
For 2700571807253945682 => "GET 200"
For 1419275746186038688 => "GET 200"
For 390652459926636238 => "GET 200"
For 146431119711779366 => "GET 200"
For 893684581945810614 => "GET 200"
For 786565621700610041 => "GET 200"
For 1144803773872185860 => "GET 200"
For 275022136736026053 => "GET 200"
For 2292284912057451804 => "GET 200"
For 1089255061101111795 => "GET 200"
For 2634207838071436812 => "GET 200"
For 3684348059125441038 => "GET 200"
For 2046679472172153999 => "GET 200"
For 2046353689068673723 => "GET 200"
For 1637270986285703099 => "GET 200"
For 614709108939515668 => "GET 200"
For 1088400192739581003 => "GET 200"
For 3690265599822971 => "GET 200"
For 2953497614927980389 => "GET 200"
For 2307814280705489173 => "GET 200"
For 1144777468189407173 => "GET 200"
For 1766242280603874522 => "GET 200"
For 1358285738244170343 => "GET 200"
For 3696316452735536533 => "GET 200"
For 122935021505366516 => "GET 200"
For 734817720787335262 => "GET 200"
For 2652103272378377551 => "GET 200"
For 4062333245916814220 => "GET 200"
For 910153365030941294 => nil
For 910153365030941294 => "GET 200"
For 1737869053690554465 => "GET 200"
For 3514565887329137230 => "GET 200"
For 4301528839474538623 => "GET 200"
For 1790721576808039568 => "GET 200"
For 3654823393663511334 => "GET 200"
For 2442576803909413631 => "GET 200"
For 3742102961440109252 => "GET 200"
For 4517522177093681438 => "GET 200"
For 1749584294186826321 => "GET 200"
For 3648034057099128386 => "GET 200"
For 4013700346875269262 => "GET 200"
For 3678823377931434068 => "GET 200"
For 1354637603475189178 => "GET 200"
For 935967362775095166 => "GET 200"
For 383449528620776387 => "GET 200"
For 4232087700265520254 => "GET 200"
For 2762090194124076215 => "GET 200"
For 68699329025794762 => "GET 200"
For 3117183228805887869 => "GET 200"
For 4524077192074584762 => "GET 200"
For 4081873845212331918 => "GET 200"
For 3724448878703590177 => "GET 200"
For 1395270574726341750 => "GET 200"
For 338738308227543685 => "GET 200"
For 3641085576676206853 => "GET 200"
For 3468997940435422347 => "GET 200"
For 439902809776032281 => "GET 200"
For 2538842384598351446 => "GET 200"
For 4162539025977822840 => "GET 200"
For 387523005028363640 => nil

The two nils in there are the requests that were concurrent with profiler reporting -- they didn't make it in time.


Having said that, there's an additional consideration I believe is relevant. As discussed in this comment, right now we have only 5 integrations that generate spans of type web (the only ones where we sample right now): rails, sinatra, grape, grpc and rack.

And after I spent some time on it, out of the 5 integrations, only rack does not set the correct resource at the beginning of the request. Which is why that's the one I picked for my experiment above.

So if all we're doing is breaking rack, and rack is not very interesting anyway, perhaps we could just not care that we are not able to get this data for rack.
BUT, if we want to support more integrations (such as sidekiq requests), would this issue come up again? Or are we confident that most ddtrace integrations that represent entry points into the application would be able to set the correct resource at the beginning?

I guess our decision needs to take that into account. @delner @marcotc let me know your thoughts.

@delner
Copy link
Contributor Author

delner commented Sep 22, 2021

I think the point above about sampling is fair. However the additional consideration is the important one:

And after I spent some time on it, out of the 5 integrations, only rack does not set the correct resource at the beginning of the request. Which is why that's the one I picked for my experiment above.

We have plans to change tracing rules such that you can't mutate the resource of the span that isn't active. This is because mutating spans (changing history) is incompatible with much of the distributed/async tracing we want to make possible in the tracer.

This mutation what happens in the case of Rack; the Rails integration modifies the Rack span to try to give the trace a prettier name. Instead, we should just elect the resource name on the span after we've assessed what the "best" name is from among the spans in the trace (in this case, likely Rails.)

This is why I have suggested choosing a resource name for the profile shouldn't be handled in ddtrace, but downstream, because the best or "correct" trace name isn't guaranteed to be available from the spans that are within scope. We will have it 100% of the time once its submitted to Datadog, but not before.

I would not expect this resource name mutation behavior to happen with other integrations other than Rack, and hopefully soon Rack won't do this either. So if that isn't considered too big of a negative (which it isn't IMO), I would still opt for this change, and try to make this trace resource feature more robust by other means.

@ivoanjo
Copy link
Member

ivoanjo commented Sep 23, 2021

I would not expect this resource name mutation behavior to happen with other integrations other than Rack, and hopefully soon Rack won't do this either.

I haven't checked; do you remember if other entry-point-ish integrations (delayed job/sidekiq/resque/sucker punch/sneakers/shoryuken/que/qless/kafka/...) either set the resource at the beginning, or are able to?

I'm curious if we'd need to deal with them setting the resource at the end, or if it really is just rack.

So if that isn't considered too big of a negative (which it isn't IMO), I would still opt for this change, and try to make this trace resource feature more robust by other means.

To be honest, if it's just for rack, I would suggest an additional alternative: just assume that the resource is a constant for a given span, like the span id and trace id are. This would mean dropping the trace_resource_container but also don't bother to pick "the most recent resource" during pprof aggregation -- just use the first one we found. E.g. effectively half of this PR.

WDWYT?

@ivoanjo
Copy link
Member

ivoanjo commented Sep 23, 2021

Actually one extra question: what happens if customers are using the rack integration together with rails/grape/etc... Is the root span going to be the rails/grape/etc... span, or the rack one?

@delner
Copy link
Contributor Author

delner commented Sep 23, 2021

I haven't checked; do you remember if other entry-point-ish integrations (delayed job/sidekiq/resque/sucker punch/sneakers/shoryuken/que/qless/kafka/...) either set the resource at the beginning, or are able to?

Yes, they are capable of doing so typically as resource is not something discovered post operation, but describes the operation about to occur. This is especially important because if the operation were to fail, we'd want it to tag the resource anyways, so its clear what resource failed: this tagging can only happen before the failure.

what happens if customers are using the rack integration together with rails/grape/etc... Is the root span going to be the rails/grape/etc... span, or the rack one?

The "local root span" will always be the initial span (typically Rack), but these frameworks will modify the resource of this local root span (typically Rack) in order to have it reflect a pretty trace name given some assumptions about how the UI works. This is technically incorrect (i.e. Rack doesn't know anything about Rails routes), even if convenient.

It's possible the "local root span" for a trace will be different if the trace crosses execution contexts, that is, if the trace spans processes or threads. This is because we don't copy trace spans between execution contexts, just the trace and parent IDs, thus a new "local root span" is made.

This last part about execution contexts may change in the near future, as we change our model for context management. So it isn't safe to make any assumptions about the structure of a trace long term; a feature that is agnostic and sticks to the "active span" API will be much more resilient against such changes.

@delner
Copy link
Contributor Author

delner commented Sep 23, 2021

To be honest, if it's just for rack, I would suggest an additional alternative: just assume that the resource is a constant for a given span, like the span id and trace id are. This would mean dropping the trace_resource_container but also don't bother to pick "the most recent resource" during pprof aggregation -- just use the first one we found. E.g. effectively half of this PR.

I think this is fine.

Can you clarify about your desired aggregation behavior in this case? Do you still want to grab the first encountered sample for a trace? (Sans timestamp-based updates?) Or drop the aggregation behavior entirely?

@ivoanjo
Copy link
Member

ivoanjo commented Sep 24, 2021

I haven't checked; do you remember if other entry-point-ish integrations (delayed job/sidekiq/resque/sucker punch/sneakers/shoryuken/que/qless/kafka/...) either set the resource at the beginning, or are able to?

Yes, they are capable of doing so typically as resource is not something discovered post operation, but describes the operation about to occur. This is especially important because if the operation were to fail, we'd want it to tag the resource anyways, so its clear what resource failed: this tagging can only happen before the failure.

👍 this makes a lot of sense. I guess only Rack with its "pass the bucket along and let's see who uses it" ends up being weird because it conflates the decision to "own" the request with the actual execution.

what happens if customers are using the rack integration together with rails/grape/etc... Is the root span going to be the rails/grape/etc... span, or the rack one?

The "local root span" will always be the initial span (typically Rack), but these frameworks will modify the resource of this local root span (typically Rack) in order to have it reflect a pretty trace name given some assumptions about how the UI works. This is technically incorrect (i.e. Rack doesn't know anything about Rails routes), even if convenient.

Ah... I guess this throws a wrench on my simplification idea. If I'm reading your note correctly, when rails/sinatra/grape are in use alone, then due to (#1626, #1628, #1629) it's guaranteed that if the profiler sees a resource, it would see the correct resource. BUT if the rack integration is enabled alongside those, then there's still a brief period between the rack span being created, and getting to rails/sinatra/grape and those "backfilling" the rack span.

That brings us back to needing some "backfilling mechanism", which so far is either what is in master or in this PR. Thanks for the extra info and putting up to the detour, now back to the regular discussion of the contents of this PR 😄


To start with, I'm ok if we break the "only-rack" testcase I posted above, because in general it's a testcase for integrations that only set the resource at the end of the request, and let's assume we can fix all those we care about to set the correct resource at the beginning.

Thus, focus right now is the behavior of web framework integrations when combined with rack (I'll call them rack+webf to make it clear from now one that these are what I mean).

For longer-running rack+webf spans (I'll put some numbers on what longer means below), the mechanism in master and the mechanism in this PR are equivalent. Even if the profiler samples the rack+webf span before rails/... sets the correct resource name (at the beginning of their own span), it will still take more samples over the lifetime of the request, and thus I'm confident we'll eventually observe the correct resource, and thus this PR's approach of using the last one seen works 💯🚀 .

I'm less confident on the approach in this PR for short-running rack+webf spans.

Lets get into some numbers: The profiler employs a dynamic sampling rate, which samples up to 100 times per second. But in reality, we sample way less than that: profiling ddtracerb exec ruby -e sleep on my mostly-idle machine gets me up to 2240 samples per minute, or 37 samples per second, which gives an average of a sample every 27 ms. Since cloud servers usually aren't a) idle and b) designed for peak single-threaded clock speed, I think it's fair to assume these numbers as an upper bound.

So, what this means for short-running rack+webf spans is that we'll get at most a single sample per span. And as the servers get more loaded, and more threads are used, and the profiler slows down the dynamic sampling rate, the more it's quite possible that this will happen.

My concern is therefore: what if for this one sample, we miss the resource? With the mechanism in master, this is handled; we can get the resource at any point we want in the future. But, with the mechanism in this PR, we'll miss the resource. Thus, I would prefer to replace the mechanism in master with something that still allows this "get resource at any point in the future" behavior, although I do realize that we could do more experiments to determine if in practice we still get the resource most of the time, even for short-running rack+webf spans.

Thoughts?

@ivoanjo
Copy link
Member

ivoanjo commented Sep 24, 2021

Thus, I would prefer to replace the mechanism in master with something that still allows this "get resource at any point in the future" behavior

In particular, this doesn't have to look quite like what master has; for instance I'd be happy with providing a callback object that the tracer would call when the root span resource is set, or when the root span is finished, and doing the bookkeeping myself.

(or other such ideas that change the way the information is conveyed)

@delner
Copy link
Contributor Author

delner commented Sep 24, 2021

So compositing together some of the constraints/circumstances we have listed in each of our comments:

  • Resource on "rack" is mutated when a web framework is present today, but won't in the future.
  • On master, if a trace is sampled once, during Rack but before Rails, and Rails occurs after that batch of samples are flushed from the recorder, it "might" get the updated resource from Rails if the mutation happens before the resource is accessed.
  • In this PR, in the same scenario, it will not receive the updated resource.

I think this is the biggest difference in functional behavior between the two.

Some thoughts about the pragmatism of the outcomes here:

  • It might be acceptable to use the Rack resource for a Rails endpoint in the above scenario IF you consider the end of a sampling period to be a true "cut off." The samples are in fact representative of what happened in that time frame. The behavior on master is a bit of revisionist history if you will: amending what happened in the past based on what happened in a future period. It wasn't doing Rails work, it was doing Rack work.
  • There's likely a sampling bias in our favor: if users are most concerned about long-running Rails requests, the odds of them being sampled during Rails are very high, meaning it will have the better resource name. Profiles of short running Rack/Rails requests are less likely to be of interest.

All things considered, I'm tempted to say we gain more by adopting this PR in the spirit its written, as I think we would sacrifice little on the usage side. (Still open to more information I hadn't considered on this point, if there's more to be said, but this is my current leaning.)

However:

In particular, this doesn't have to look quite like what master has; for instance I'd be happy with providing a callback object that the tracer would call when the root span resource is set, or when the root span is finished, and doing the bookkeeping myself.

This does remind me we do have a processing pipeline mechanism in which you can add a proc to look at each output span. Could be done when the profiler sets up. I'm not sure this would be good enough, because the pipeline is only triggered when the whole trace is flushed (root span is finished), not when individual spans are closed.

(EDIT: actually worse, I think pipeline only runs when the writer runs, as often as every second, so this is probably no good.)

As we consider changes to context management for tracing, there might be a similar opportunity to implement callbacks for spans. It might even be acceptable to somehow tie this into SpanOperation implemented #1675 , and have this construct implement a callback for on_finish (something that Python more or less did already.)

The good news is that SpanOperation would be accessible for profiling via active_span, which it could retrieve, and subscribe to with a proc that updates the sample. Haven't thought through all the possible implications (and I may not love the idea when I have), but SpanOperation is meant to encapsulate all this kind of communication between components, so it would address my concern about making Span too complex.

All that said, I think my order of preference in resolving this is:

  1. Eliminating all of this code and doing aggregation on the Datadog side.
  2. Using this PR
  3. Possibly implementing callbacks with SpanOperation
  4. Doing nothing

Let me know what you think @ivoanjo.

@ivoanjo
Copy link
Member

ivoanjo commented Sep 27, 2021

  • On master, if a trace is sampled once, during Rack but before Rails, and Rails occurs after that batch of samples are flushed from the recorder, it "might" get the updated resource from Rails if the mutation happens before the resource is accessed.

Not quite :)
The recorder doesn't really come into play here; the issue is purely on the sampling side: if, when we sample the request, we don't see the correct resource (even if it is set later, once it gets to Rails), then we lost our chance. The ultimate case of this is the setting resource at the end (which leads to resource being missed 100%), but my question is, in the rack+webf traces, I think we need some data on what's the % missed.

  • It might be acceptable to use the Rack resource for a Rails endpoint in the above scenario IF you consider the end of a sampling period to be a true "cut off." The samples are in fact representative of what happened in that time frame. The behavior on master is a bit of revisionist history if you will: amending what happened in the past based on what happened in a future period. It wasn't doing Rails work, it was doing Rack work.

Note that this is used as a grouping/slicing mechanism in the profiler side, not as a "blame Rails" mechanism. The stacks will obviously show where exactly in the code the cpu/wall-clock time was spent, so customers will be able to tell that to serve this rails endpoint, that actually a lot of time was spent in some rack middleware they have, even before it gets to Rails.

  • There's likely a sampling bias in our favor: if users are most concerned about long-running Rails requests, the odds of them being sampled during Rails are very high, meaning it will have the better resource name. Profiles of short running Rack/Rails requests are less likely to be of interest.

I'd say this is where tracing and profiling end up separating. We're interested in providing visibility into everything the app is doing, either short or long tasks. If an app does 99.99% short requests, and a few outlier long requests, we want to have information on all of them; hotspots data becomes available for requests longer than 10ms, and for the shorter ones users can look at individual profiles and/or aggregated data.

This does remind me we do have a processing pipeline mechanism in which you can add a proc to look at each output span. Could be done when the profiler sets up. I'm not sure this would be good enough, because the pipeline is only triggered when the whole trace is flushed (root span is finished), not when individual spans are closed.

(EDIT: actually worse, I think pipeline only runs when the writer runs, as often as every second, so this is probably no good.)

As we consider changes to context management for tracing, there might be a similar opportunity to implement callbacks for spans. It might even be acceptable to somehow tie this into SpanOperation implemented #1675 , and have this construct implement a callback for on_finish (something that Python more or less did already.)

On closer inspection, my suggestion of "for instance I'd be happy with providing a callback object that the tracer would call when the root span resource is set, or when the root span is finished, and doing the bookkeeping myself" actually would only work if the callback was provided when the resource span is set, not when it finishes.

Otherwise, we're back to the rack+webf issue I described -- if the callback is only called at the end of the span, that's already too late.

So this on_finish would not solve the issue, it really needs to be when the resource is set.

All that said, I think my order of preference in resolving this is:

  1. Eliminating all of this code and doing aggregation on the Datadog side.
  2. Using this PR
  3. Possibly implementing callbacks with SpanOperation
  4. Doing nothing

I don't think # 1 is technically feasible at this time; but definitely would not object to it as it would simplify a lot in the client side and effectively allow us to remove the whole gathering tracking resources code.

In my opinion, there's a bunch of things in flux in the planning for dd-trace-rb: refactoring spans (+ adding callbacks?), possibly stopping rack resource mutation, etc; and thus I favor in keeping the approach from master until we're far enough into those changes/redesigns to be able to add a replacement mechanism.

@delner
Copy link
Contributor Author

delner commented Sep 27, 2021

Note that this is used as a grouping/slicing mechanism in the profiler side, not as a "blame Rails" mechanism. The stacks will obviously show where exactly in the code the cpu/wall-clock time was spent, so customers will be able to tell that to serve this rails endpoint, that actually a lot of time was spent in some rack middleware they have, even before it gets to Rails.

But it hasn't reached Rails in the time period, and from the context of the time period, it may never reach Rails if the Rack middleware short circuits. Thus modifying it retrospectively after the time period when it might be in Rails is revising history, which is why I don't think this is accurate or good practice.

If we were aggregating all the samples from a trace's lifetime in the application, it would be more appropriate to look at the set and select the best resource name. It's not really the responsibility of an individual span (Rack) to decide & communicate what the trace name is, as it only has a small set of the trace information (which is why Rack doesn't set a better resource in the first place.)

In my opinion, there's a bunch of things in flux in the planning for dd-trace-rb: refactoring spans (+ adding callbacks?), possibly stopping rack resource mutation, etc; and thus I favor in keeping the approach from master until we're far enough into those changes/redesigns to be able to add a replacement mechanism.

IMO, this actually reduces the flux by changing Span back to what it was, which simplifies the design and behavior (as described in the original post.) Decisions about resource mutation, SpanOperation and so on are discussions to be had and decided upon in those PRs.

Having considered the pros (simpler design, more concurrency-safe code) and cons (an edge case where requests are sampled only during Rack but never during or after Rails) I think the stability trade-offs are worthwhile, so I intend on merging this PR. Thank you for your feedback; I think it's given us a thorough understanding of the behavior here, and will be useful when considering another iteration on this feature.

@delner delner merged commit 08dfe89 into master Sep 27, 2021
@delner delner deleted the refactor/profiling_use_recent_event branch September 27, 2021 15:37
@github-actions github-actions bot added this to the 0.53.0 milestone Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants