-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed tracing: continue multiple traces #2964
Comments
I'm interested in your use case, but I'm having trouble understanding the trace flow you're describing, and visualizing the desired output in the UI. Particularly, I'm unfamiliar with "firehose" and how this batching procedure works. Is there anyway to diagram this? Share psuedo-code that demonstrates how things are should work? What have you tried using in the existing distributed tracing? Where did it fall short? |
The AWS SDK function I mean is this one. Essentially, I have a background thread, let's say in a sidekiq process, collecting data from a queue, and then batch-sending: firehose = Aws::Sdk::Firehose.new
$queue = Queue.new
Thread.start do
loop do
records = []
while !$queue.empty? && records.size < 40
records << $queue.pop
end
firehose.put_record_batch(records)
end
end Meanwhile, any sidekiq worker may be writing to that queue: def perform(arg)
data = heavy_computation(arg)
$queue << data
end What I would like is to link the trace from the sidekiq worker all the way down to the firehose put record call writing that def perform(arg)
data = heavy_computation(arg)
$queue << add_trace_context(data)
end
# and later
while !$queue.empty? && records.size < 40
data = $queue.pop
DDtrace.continue_trace(data) # 40 calls at once!
records << data[:data]
end
# and now the aws sdk call start a new trace, continuing the 40 above.
firehose.put_record_batch(records) |
I see; this definitely helps a lot! If I'm understanding correctly, it's basically a "fan-in", where the trace of the process reading from the queue is downstream of 40 other traces. In other words, the trace has 40+ parents (the 40 sidekiq jobs and whatever started this process itself.)
I think it effectively boils down to a problem of multiple inheritance... what is a parent? And if you have multiple parents (each of which is its own trace), what trace do you belong? The trace of the "queue reading operation" should almost certainly annotate all of its ancestors (the firehose process and the 40 Sidekiq jobs preceding its own work.) It's just a question of how should it be annotated, and how it should be displayed in the UX. Regarding annotation...
Regarding display... ...I don't know if we have a good way of displaying this in a trace view (aka "flamegraph"). This visualization is great for showing synchronous processes, but doesn't do well with asynchronous behavior. And it definitely doesn't show multiple traces today. IMO, it would be more compelling to show a "trace graph" where you could see the "queue read operation" as a node on a directed graph, with the 40 sidekiq job nodes pointing to it. Drilling into each node would display the "flamegraph" for that individual trace. Today, this behavior doesn't exist, and it goes beyond the Ruby library. But its an interesting use case; I'd love to float this to our internal team to have them examine it further, consider what we could do to support it. Any description of the visual output/UX of what you might expect to see by doing this would be helpful in contextualizing the feature request, make the value clear. |
Thx for taking interest! Visually, I don't expect anything wildly different than what exists today: for instance, if you're jumping into a "child trace" and zoom out, you may see multiple traces above, as the root trace may have spawned many "child traces", each with their own children, and you can only visually identify the hierarchy by the lines which vertically cut them. So in the case of a firehose trace iin the example above, that line with connect with several other "uber traces", which aren't related to each other (which may be difficult considering the limitations off the current implementation). |
I tried brainstorming what a visualization could look like (granted I'm not a designer)... Does that roughly model what you're looking for? (For the sake of argument?) I think the team has some interest in adding support for more interconnection between related traces like this. It could be a while before this sort of thing is supported first class in Ruby, pending those designs. I'll update you on this as there are developments. |
It looks like a decent start, for sure 👌thx |
Just an update; our team is internally looking at ways to support this kind of behavior more broadly in tracing & APM. I'm sharing this example as a use case to help contextualize our designs. I'd like to see if we can figure out some first-class support on that end, before implementing anything in Ruby. It could take a bit though. We'll try to keep you updated when we have some to share! Thanks for highlighting this @HoneyryderChuck! |
Is your feature request related to a problem? Please describe.
The problem I'm having relates to a set of concurrent operations (under a given datadog trace) which pipe a data structure into a queue; this queue is then flushed into a single batch, that I then send to a firehose via
put_record_batch
(AWS SDK Firehose).Describe the goal of the feature
As per the description above, I'd like to, given a set of operations which propagate tracing context into the datastructure in the intermediate queue, and once they're all fetched into a batch, resume all of them before calling the mentioned AWS API call (which will start a new aws-sdk related trace, and should pick up all the traces from the messages in the batch.
Describe alternatives you've considered
I didn't really think about an alternative. Not sure if it's a bug either, or something accomplished by another datadog feature.
How does
ddtrace
help you?distributed tracing feature certainly helps me keeping tabs on a given flow split into chunks of separated concurrent / distributed workloads.
The text was updated successfully, but these errors were encountered: