-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9263] Add experimental support for profiling code hotspots when used with opentelemetry ruby gem #3510
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! And so safe you should be designing baby furniture 😄
@@ -734,6 +765,11 @@ static void trigger_sample_for_thread( | |||
struct trace_identifiers trace_identifiers_result = {.valid = false, .trace_endpoint = Qnil}; | |||
trace_identifiers_for(state, thread, &trace_identifiers_result); | |||
|
|||
if (!trace_identifiers_result.valid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth/possible doing a bit of extra work at the start to arrive at a sticky decision here and short-circuit constantly failing trace_identifiers_for
with all its rb_var_get if ddtrace is not used for tracing at all?
Or thinking is that we want to support situations where there's a mix of ddtrace and pure-ot traces and/or the ability to change between one and the other dynamically (e.g. via a feature flag)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your suggestion makes sense.
My intent here in checking both is that the profiler may start quite early in the app lifecycle, so we may not know which one is going to be used yet.
Or thinking is that we want to support situations where there's a mix of ddtrace and pure-ot traces and/or the ability to change between one and the other dynamically (e.g. via a feature flag)?
I'm not sure mixing is even possible at this point, since the ddtrace otel support monkey patches itself pretty deep into opentelemetry (which is why I needed to contort a bit to be able to test both).
For that reason, and after our last discussion, I think it makes sense to stop checking opentelemetry once we see data coming from ddtrace traces.
The reverse is harder to figure out, actually. It would be weird, but not impossible, for an app that started with opentelemetry to then switch over to ddtrace.
TL;DR: I'll wait for feedback from our customer on how this is working before acting on this comment, just in case we end up going in a completely different direction BUT I'll definitely come back to it before marking the PR as non-draft.
We're waiting for a bit more feedback on if this is the right approach before going forward. If there's not a lot of movement in a few months we can close this PR. |
…tel gem Things missing: * Specs conflict with ddtrace otel specs (need to poke at appraisals) * Missing endpoint support
While we don't need the actual span object to read the span ids, we will need it to read the endpoint names.
… specs I'm... unhappy about this, but couldn't think of anything better that wouldn't involve refactoring the ddtrace tracing otel support and that seems even worse.
Sigh old rubies...
796ea12
to
19646f2
Compare
Update: I've rebased this PR on top of the v2.3.0 stable release to help testing. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3510 +/- ##
==========================================
- Coverage 97.86% 97.84% -0.03%
==========================================
Files 1271 1271
Lines 75976 76097 +121
Branches 3739 3746 +7
==========================================
+ Hits 74356 74455 +99
- Misses 1620 1642 +22 ☔ View full report in Codecov by Sentry. |
BenchmarksBenchmark execution time: 2024-10-01 15:28:53 Comparing candidate commit 6b67c03 in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 22 metrics, 2 unstable metrics. scenario:profiler - sample timeline=false
|
This PR is going to be superceded by #3984 . I'm going to leave it open for a while longer to give customers testing this feature time to migrate to a release containing the newer version of this PR. |
Erghh our automation was a bit too automated -- I've restored the branch, and removed this from the 2.4.0 milestone (since it was #3984 that got merged). |
What does this PR do?
This PR adds experimental support for getting profiling code hotspots data (including endpoint profiling) when profiling processes being traced using the opentelemetry ruby gem directly.
Note that this differs from the recommended way of using opentelemetry with the ddtrace library, which is to follow the instructions from https://docs.datadoghq.com/tracing/trace_collection/custom_instrumentation/otel_instrumentation/ruby/ .
The key difference is -- this PR makes code hotspots work even for setups that opt to not use
require 'datadog/opentelemetry'
(which is the recommended and easier way).The approach taken here is similar to #2342 and #3466: we peek inside the implementation of the opentelemetry gem to extract the information we need (namely the span id, local root span id, trace type, and trace endpoint). This approach is potentially brittle, which is why the code is written very defensively, with the aim of never breaking the application (or profiling) if something is off -- it just won't collect code hotspots.
Motivation:
We have a customer interested in running this setup, so hopefully they'll be able to test this PR and validate if it works for them.
Furthermore, I'm hoping to see if the opentelemetry Ruby folks would be open to tweaking their APIs to be more friendlier to tools such as the profiler, but for now I opted for getting our hands dirt.
Additional Notes:
I'm opening this PR as draft until we can get feedback from the customer and see if this works for them.
How to test the change?
On top of the added test coverage, I was able to see code hotspots working for the following sinatra example app:
After doing a few requests, here's how this looks: