Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed tracing: continue multiple traces #2964

Open
HoneyryderChuck opened this issue Jul 13, 2023 · 7 comments
Open

Distributed tracing: continue multiple traces #2964

HoneyryderChuck opened this issue Jul 13, 2023 · 7 comments
Assignees
Labels
community Was opened by a community member feature-request A request for a new feature or change to an existing one tracing

Comments

@HoneyryderChuck
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The problem I'm having relates to a set of concurrent operations (under a given datadog trace) which pipe a data structure into a queue; this queue is then flushed into a single batch, that I then send to a firehose via put_record_batch (AWS SDK Firehose).

Describe the goal of the feature

As per the description above, I'd like to, given a set of operations which propagate tracing context into the datastructure in the intermediate queue, and once they're all fetched into a batch, resume all of them before calling the mentioned AWS API call (which will start a new aws-sdk related trace, and should pick up all the traces from the messages in the batch.

Describe alternatives you've considered

I didn't really think about an alternative. Not sure if it's a bug either, or something accomplished by another datadog feature.

How does ddtrace help you?

distributed tracing feature certainly helps me keeping tabs on a given flow split into chunks of separated concurrent / distributed workloads.

@HoneyryderChuck HoneyryderChuck added community Was opened by a community member feature-request A request for a new feature or change to an existing one labels Jul 13, 2023
@delner
Copy link
Contributor

delner commented Jul 13, 2023

I'm interested in your use case, but I'm having trouble understanding the trace flow you're describing, and visualizing the desired output in the UI. Particularly, I'm unfamiliar with "firehose" and how this batching procedure works.

Is there anyway to diagram this? Share psuedo-code that demonstrates how things are should work? What have you tried using in the existing distributed tracing? Where did it fall short?

@delner delner self-assigned this Jul 13, 2023
@HoneyryderChuck
Copy link
Contributor Author

The AWS SDK function I mean is this one.

Essentially, I have a background thread, let's say in a sidekiq process, collecting data from a queue, and then batch-sending:

firehose = Aws::Sdk::Firehose.new
$queue = Queue.new

Thread.start do
   loop do
     records = []
     while !$queue.empty? && records.size < 40
       records << $queue.pop
     end
     firehose.put_record_batch(records)
   end
end

Meanwhile, any sidekiq worker may be writing to that queue:

def perform(arg)
  data = heavy_computation(arg)
  $queue << data
end

What I would like is to link the trace from the sidekiq worker all the way down to the firehose put record call writing that data field. It's sorta-kinda what the distributed tracing modules do (i.e. pass trace id to the next request / job queue / smth else), but in this case, because the sdk call involves many units of work (in this case 40 of them), I'd like a way to say "please continue these 40 traces in the next call to the aws sdk which starts a trace":

def perform(arg)
  data = heavy_computation(arg)
  $queue << add_trace_context(data)
end

# and later

 while !$queue.empty? && records.size < 40
    data = $queue.pop
    DDtrace.continue_trace(data) # 40 calls at once!
    records << data[:data]
 end
# and now the aws sdk call start a new trace, continuing the 40 above.
firehose.put_record_batch(records)

@delner
Copy link
Contributor

delner commented Jul 19, 2023

I see; this definitely helps a lot!

If I'm understanding correctly, it's basically a "fan-in", where the trace of the process reading from the queue is downstream of 40 other traces. In other words, the trace has 40+ parents (the 40 sidekiq jobs and whatever started this process itself.)

firehose.process ---------------------------> queue.pop
                                         /
sidekiq.job -----------------------------
                                       /
sidekiq.job ---------------------------
                                     /
38 others... ------------------------

I think it effectively boils down to a problem of multiple inheritance... what is a parent? And if you have multiple parents (each of which is its own trace), what trace do you belong?

The trace of the "queue reading operation" should almost certainly annotate all of its ancestors (the firehose process and the 40 Sidekiq jobs preceding its own work.) It's just a question of how should it be annotated, and how it should be displayed in the UX.

Regarding annotation...

  • It probably would make the most sense that the parent is any operation that synchronously preceded the current operation within the same execution context. This would mean there would only ever be, at most, one parent: in this case, the "firehose process".
  • Any operation that precedes the current in a separate execution context (aka asynchronously) would not be a parent. Instead the current operation would be annotated to have "followed from" that operation. In this case, the queue reading operation will have "followed from" 40 Sidekiq jobs.

Regarding display...

...I don't know if we have a good way of displaying this in a trace view (aka "flamegraph"). This visualization is great for showing synchronous processes, but doesn't do well with asynchronous behavior. And it definitely doesn't show multiple traces today. IMO, it would be more compelling to show a "trace graph" where you could see the "queue read operation" as a node on a directed graph, with the 40 sidekiq job nodes pointing to it. Drilling into each node would display the "flamegraph" for that individual trace.


Today, this behavior doesn't exist, and it goes beyond the Ruby library. But its an interesting use case; I'd love to float this to our internal team to have them examine it further, consider what we could do to support it. Any description of the visual output/UX of what you might expect to see by doing this would be helpful in contextualizing the feature request, make the value clear.

@HoneyryderChuck
Copy link
Contributor Author

Thx for taking interest!

Visually, I don't expect anything wildly different than what exists today: for instance, if you're jumping into a "child trace" and zoom out, you may see multiple traces above, as the root trace may have spawned many "child traces", each with their own children, and you can only visually identify the hierarchy by the lines which vertically cut them. So in the case of a firehose trace iin the example above, that line with connect with several other "uber traces", which aren't related to each other (which may be difficult considering the limitations off the current implementation).

@delner
Copy link
Contributor

delner commented Jul 21, 2023

I tried brainstorming what a visualization could look like (granted I'm not a designer)...

Tracing use cases - Follows from flamegraph

Does that roughly model what you're looking for? (For the sake of argument?)

I think the team has some interest in adding support for more interconnection between related traces like this. It could be a while before this sort of thing is supported first class in Ruby, pending those designs. I'll update you on this as there are developments.

@HoneyryderChuck
Copy link
Contributor Author

It looks like a decent start, for sure 👌thx

@delner
Copy link
Contributor

delner commented Aug 15, 2023

Just an update; our team is internally looking at ways to support this kind of behavior more broadly in tracing & APM. I'm sharing this example as a use case to help contextualize our designs. I'd like to see if we can figure out some first-class support on that end, before implementing anything in Ruby. It could take a bit though. We'll try to keep you updated when we have some to share!

Thanks for highlighting this @HoneyryderChuck!

@ivoanjo ivoanjo added the tracing label Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Was opened by a community member feature-request A request for a new feature or change to an existing one tracing
Projects
None yet
Development

No branches or pull requests

3 participants