Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9263] Add experimental support for profiling code hotspots when used with opentelemetry ruby gem #3510

Merged
10 commits merged into from
Oct 10, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 6, 2024

What does this PR do?

This PR adds experimental support for getting profiling code hotspots data (including endpoint profiling) when profiling processes being traced using the opentelemetry ruby gem directly.

Note that this differs from the recommended way of using opentelemetry with the ddtrace library, which is to follow the instructions from https://docs.datadoghq.com/tracing/trace_collection/custom_instrumentation/otel_instrumentation/ruby/ .

The key difference is -- this PR makes code hotspots work even for setups that opt to not use require 'datadog/opentelemetry' (which is the recommended and easier way).

The approach taken here is similar to #2342 and #3466: we peek inside the implementation of the opentelemetry gem to extract the information we need (namely the span id, local root span id, trace type, and trace endpoint). This approach is potentially brittle, which is why the code is written very defensively, with the aim of never breaking the application (or profiling) if something is off -- it just won't collect code hotspots.

Motivation:

We have a customer interested in running this setup, so hopefully they'll be able to test this PR and validate if it works for them.

Furthermore, I'm hoping to see if the opentelemetry Ruby folks would be open to tweaking their APIs to be more friendlier to tools such as the profiler, but for now I opted for getting our hands dirt.

Additional Notes:

I'm opening this PR as draft until we can get feedback from the customer and see if this works for them.

How to test the change?

On top of the added test coverage, I was able to see code hotspots working for the following sinatra example app:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'

  gem 'rackup'
  gem 'dogstatsd-ruby'
  gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb', branch: 'ivoanjo/prof-9263-otlp-ruby-code-hotspots'
  gem 'sinatra'
  gem 'opentelemetry-api'
  gem 'opentelemetry-sdk'
  gem 'opentelemetry-instrumentation-sinatra'
  gem 'opentelemetry-exporter-otlp'
  gem 'pry'
end

require 'sinatra/base'
require 'opentelemetry/sdk'
require 'pry'

Datadog.configure do |c|
  c.service = 'ivoanjo-testing-opentelemetry-test'
  c.profiling.enabled = true
end

# Configure OpenTelemetry
OpenTelemetry::SDK.configure do |c|
  c.service_name = 'ivoanjo-testing-opentelemetry-test'
  c.use 'OpenTelemetry::Instrumentation::Sinatra'
end

class MyApp < Sinatra::Base
  get '/' do
    OpenTelemetry::Trace.current_span.add_attributes({'runtime-id' => Datadog::Core::Environment::Identity.id})
    sleep 1
    'Hello, OpenTelemetry!'
  end
end

MyApp.run!

After doing a few requests, here's how this looks:

image


image

@ivoanjo ivoanjo requested review from a team as code owners March 6, 2024 15:06
@ivoanjo ivoanjo marked this pull request as draft March 6, 2024 15:06
@github-actions github-actions bot added the profiling Involves Datadog profiling label Mar 6, 2024
Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! And so safe you should be designing baby furniture 😄

@@ -734,6 +765,11 @@ static void trigger_sample_for_thread(
struct trace_identifiers trace_identifiers_result = {.valid = false, .trace_endpoint = Qnil};
trace_identifiers_for(state, thread, &trace_identifiers_result);

if (!trace_identifiers_result.valid) {
Copy link
Contributor

@AlexJF AlexJF Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth/possible doing a bit of extra work at the start to arrive at a sticky decision here and short-circuit constantly failing trace_identifiers_for with all its rb_var_get if ddtrace is not used for tracing at all?

Or thinking is that we want to support situations where there's a mix of ddtrace and pure-ot traces and/or the ability to change between one and the other dynamically (e.g. via a feature flag)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your suggestion makes sense.

My intent here in checking both is that the profiler may start quite early in the app lifecycle, so we may not know which one is going to be used yet.

Or thinking is that we want to support situations where there's a mix of ddtrace and pure-ot traces and/or the ability to change between one and the other dynamically (e.g. via a feature flag)?

I'm not sure mixing is even possible at this point, since the ddtrace otel support monkey patches itself pretty deep into opentelemetry (which is why I needed to contort a bit to be able to test both).

For that reason, and after our last discussion, I think it makes sense to stop checking opentelemetry once we see data coming from ddtrace traces.

The reverse is harder to figure out, actually. It would be weird, but not impossible, for an app that started with opentelemetry to then switch over to ddtrace.


TL;DR: I'll wait for feedback from our customer on how this is working before acting on this comment, just in case we end up going in a completely different direction BUT I'll definitely come back to it before marking the PR as non-draft.

@ivoanjo ivoanjo added do-not-merge/WIP Not ready for merge otel OpenTelemetry-related changes labels Jul 30, 2024
@ivoanjo
Copy link
Member Author

ivoanjo commented Jul 30, 2024

We're waiting for a bit more feedback on if this is the right approach before going forward. If there's not a lot of movement in a few months we can close this PR.

…tel gem

Things missing:

* Specs conflict with ddtrace otel specs (need to poke at appraisals)
* Missing endpoint support
While we don't need the actual span object to read the span ids, we
will need it to read the endpoint names.
… specs

I'm... unhappy about this, but couldn't think of anything better that
wouldn't involve refactoring the ddtrace tracing otel support and
that seems even worse.
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-9263-otlp-ruby-code-hotspots branch from 796ea12 to 19646f2 Compare September 25, 2024 15:38
@ivoanjo
Copy link
Member Author

ivoanjo commented Sep 25, 2024

Update: I've rebased this PR on top of the v2.3.0 stable release to help testing.

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2024

Codecov Report

Attention: Patch coverage is 99.20000% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.84%. Comparing base (c5ab063) to head (6b67c03).
Report is 346 commits behind head on master.

Files with missing lines Patch % Lines
...atadog/profiling/collectors/thread_context_spec.rb 99.09% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3510      +/-   ##
==========================================
- Coverage   97.86%   97.84%   -0.03%     
==========================================
  Files        1271     1271              
  Lines       75976    76097     +121     
  Branches     3739     3746       +7     
==========================================
+ Hits        74356    74455      +99     
- Misses       1620     1642      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented Sep 25, 2024

Benchmarks

Benchmark execution time: 2024-10-01 15:28:53

Comparing candidate commit 6b67c03 in PR branch ivoanjo/prof-9263-otlp-ruby-code-hotspots with baseline commit c5ab063 in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 22 metrics, 2 unstable metrics.

scenario:profiler - sample timeline=false

  • 🟩 throughput [+0.743op/s; +0.764op/s] or [+11.563%; +11.905%]

@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 9, 2024

This PR is going to be superceded by #3984 . I'm going to leave it open for a while longer to give customers testing this feature time to migrate to a release containing the newer version of this PR.

@ivoanjo ivoanjo closed this pull request by merging all changes into master in 89d23d6 Oct 10, 2024
@ivoanjo ivoanjo deleted the ivoanjo/prof-9263-otlp-ruby-code-hotspots branch October 10, 2024 08:32
@github-actions github-actions bot added this to the 2.4.0 milestone Oct 10, 2024
@ivoanjo ivoanjo restored the ivoanjo/prof-9263-otlp-ruby-code-hotspots branch October 11, 2024 12:58
@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 11, 2024

Erghh our automation was a bit too automated -- I've restored the branch, and removed this from the 2.4.0 milestone (since it was #3984 that got merged).

@ivoanjo ivoanjo removed this from the 2.4.0 milestone Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/WIP Not ready for merge otel OpenTelemetry-related changes profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants