-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tail Sampling Processor is using async spans to count the latency duration #24912
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
To clarify: the trace took 505ms, where 5ms was processing time and 500ms was waiting time. Configuring the tail-sampling processor with a threshold of 10ms samples this trace, and you expect it not to. Is that correct? |
@jpkrohling Saudações brazucas! Thanks for being the code owner of the tail sampling processor. |
Got it. I suppose there are two ways of interpreting that, with one being the total amount of time as experienced by the user, which takes the timestamp of the root span plus the latest timestamp+duration across all spans (which is the current algorithm, as far as I remember), the second being a sum of the durations minus the overlap time between them. It looks like you want the second. I'd be curious to see a PR for this. My feeling without attempting it is that it would take quite some cycles to compute an accurate duration for that.
\o/ |
Hi @jpkrohling, On my initial message, I talk about the use-case of an async call at the end of a request processing. The async call can be finished way long after the client has received the response! |
You are absolutely right. Those are typically the same for sync processes, but likely different when async parts are involved. Anyway, I'd be interested in reviewing a PR for that. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@tonidas, are you still interested in working on this? |
@jpkrohling thanks for the opportunity, but I don't think I'll have time to implement this for a while. |
That's alright. I'm closing this for now, but if you (or anyone else) is interested in contributing code for this feature, I can reopen. |
Component(s)
processor/tailsampling
What happened?
Description
Tail Sampling Processor can sample more traces than it should, if there are async activity on the end of the request trace
Steps to Reproduce
Send a request trace that has Kafka async producing at the end of the request (that will finish after the response was already sent). Lets assume that the request had 5ms of execution time, and that the kafka production had 500ms of execution time.
Expected Result
If the Tail Sampling Processor is configured with a latency threshold of 10ms, I expect that this trace doesn't get sampled
Actual Result
If the Tail Sampling Processor is configured with a latency threshold of 10ms, this trace will get sampled
Collector version
v0.8.x
Environment information
Environment
OS: Amazon Linux 2
Compiler(if manually compiled): N/A
OpenTelemetry Collector configuration
Log output
No response
Additional context
Right now I'm running a dirty otel-collector-contrib using a patch to ignore kafka producing spans on Tail Sampling Processor latency duration calculation, but it's not generic enough to be pull-requested. Is there a broad way of detecting async spans?
I think that the best approach it's to provide a configuration for the user to choose between those two approaches and the default one should be to ignore the async spans. The mindset here is that it's much more straightforward to think about the latency as the duration of the request lifetime.
The text was updated successfully, but these errors were encountered: