Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-4198] Gather CPU time in profiler without monkey patching Thread for all supported Ruby versions (2.1 to 3.x) #1740

Merged
merged 14 commits into from
Nov 5, 2021

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Oct 27, 2021

In #1735 I added initial support to gather CPU time information without relying on monkey patching for Ruby versions 2.6+.

As a reminder from that PR:

Our previous approach(the cthread.rb monkey patch) has a few downsides that complicate ours and our customer's lives that stem from the need to monkey patch Thread initialization.

  1. Some threads not to be instrumented because they are created in native code using the rb_thread_create() API (as used by the iodine web server and the ffi gem)

  2. Some other threads not to be instrumented because they are created BEFORE we can apply our monkey patch.

  3. Because of item 2. our onboarding is made harder because you can't "just" start the profiler at any point you want in the application life cycle -- you need to start the profiler as early as possible so that we can apply the monkey patch before any other thread gets created.

  4. Conflicts with other gems that monkey patch Thread. So far we found two: rollbar and logging. These conflicts break customer apps, and thus we must coordinate with the developers of those libraries, but still detect old versions.

  5. General noise because profiler frames show up at the base of (most but not all!) threads.

  6. A lot of awkwardness in testing since we may not want to monkey patch the RSpec process.

The new design uses native code to implement the NativeExtension.clock_id_for(thread) method that can retrieve the needed information at any point in the application life cycle, for any thread.

This PR then finishes the work started in #1735 by extending the profiling native extension to also support legacy Ruby versions (e.g. < 2.6), and thus meaning that we can use the profiling native extension on all supported Ruby versions.

We're still restricted to Linux, but that was also the case with the previous approach.

Because we can now use the profiling native extension for all Rubies, this PR also includes the next step: it wires the Stack collector to use the native extension directly, and then removes all references to our previous cthread.rb monkey patch. Hence most of this PR being deletions.

Because the cthread.rb was the only reason we depended on ffi, I've also gone ahead and removed that dependency.

Finally, but still quite importantly, this PR adds a new dependency to dd-trace-rb: the debase-ruby_core_source gem.

The debase-ruby_core_source gem is effectively a just tarball of Ruby VM internal headers for every Ruby version. Because these headers are not installed in regular Ruby deployments, we use this gem to get access to a copy of them, and thus to enable the profiling native extension to reference internal VM structures that are not otherwise exposed (see NativeExtensionDesign.md for more details).

I claim it's both ok and desirable to add this gem as a dependency because:

  1. It's a regular pure-Ruby gem, so it will work fine for all customers that upgrade to the latest dd-trace-rb
  2. It's maintained by the same team that maintains debase, which is the team working at JetBrains on the RubyMine IDE (debase is officially used by RubyMine for Ruby debugging)
  3. It allows us to move existing profiling customers over to the new, better CPU time approach without needing to take any manual steps. Just upgrade to the latest version => works even better. Otherwise, the upgrade experience would be bumpier: users would upgrade and suddenly lose cpu profiling.

…by < 2.6

By relying on the `debase-ruby_core_source` extension, we can get access
to the internal VM headers needed to implement this functionality on
all legacy Ruby versions we support.

Note that this does not yet change the profiler to use
`NativeExtension.clock_id_for` on these older Rubies.
This abstracts away the actual operation we want in our code -- getting
the cpu time for a given thread, rather than exposing the clock_id
and having callers handle it.
We now gather cpu time directly using the native extension, rather
than relying on the `cthread.rb` monkey patch.

This allows us to remove `cthread.rb` (next commit), as well as to
clean up the whole "let's try to detect when there's missing
instrumentation" approach we had going on.

It also enables us to simpify the tests. All great advantages!
We want to preserve this knowledge and tests, so before deleting
`cthread.rb` and its spec, let's move this to the stack sampler.
After we move CPU time to use the native extension, we no longer need
the "CPU extension" that monkey patches `Thread` creation.
It's been fully replaced by our usage of the native extension to get
this information.
`ffi` was only used by the profiling CPU time extension (`cthread.rb`)
which has been removed so let's get rid of it. Thanks `ffi` for your
service.
We still have a lot of other Windows-related failures, so let's get
back to this once 3.1 is released.
@ivoanjo ivoanjo requested a review from a team October 27, 2021 13:50
@codecov-commenter
Copy link

codecov-commenter commented Oct 27, 2021

Codecov Report

Merging #1740 (e1c4fa3) into master (6180c0e) will increase coverage by 0.03%.
The diff coverage is 98.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1740      +/-   ##
==========================================
+ Coverage   98.18%   98.21%   +0.03%     
==========================================
  Files         934      930       -4     
  Lines       45196    44844     -352     
==========================================
- Hits        44374    44042     -332     
+ Misses        822      802      -20     
Impacted Files Coverage Δ
lib/ddtrace/profiling.rb 100.00% <ø> (ø)
spec/ddtrace/profiling/spec_helper.rb 100.00% <ø> (+13.51%) ⬆️
spec/ddtrace/profiling/collectors/stack_spec.rb 98.07% <96.07%> (-0.57%) ⬇️
spec/ddtrace/profiling/native_extension_spec.rb 96.59% <98.11%> (-1.03%) ⬇️
lib/ddtrace/profiling/collectors/stack.rb 100.00% <100.00%> (ø)
lib/ddtrace/profiling/native_extension.rb 83.33% <100.00%> (+13.33%) ⬆️
lib/ddtrace/profiling/tasks/setup.rb 95.65% <100.00%> (+0.53%) ⬆️
spec/ddtrace/contrib/rails/rack_spec.rb 100.00% <100.00%> (ø)
spec/ddtrace/profiling/integration_spec.rb 97.29% <100.00%> (+2.20%) ⬆️
spec/ddtrace/profiling/tasks/setup_spec.rb 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6180c0e...e1c4fa3. Read the comment docs.

@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 27, 2021

Tip: This is one of those PRs where "✅. Hide whitespace" helps because in a bunch of the specs I had to adjust the indentation level and the resulting diff looks really weird without this option.

Screen Shot 2021-10-27 at 14 52 05

ddtrace.gemspec Outdated
@@ -47,7 +47,7 @@ Gem::Specification.new do |spec|
end

# Used by the profiler
spec.add_dependency 'ffi', '~> 1.0'
spec.add_dependency 'debase-ruby_core_source', '>= 0.10.12'
Copy link
Member

@marcotc marcotc Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for users with versions of Ruby not in the list, for example 2.5.6p201? https://github.com/ruby-debug/debase-ruby_core_source/tree/master/lib/debase/ruby_core_source

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It falls back to the closest lower version:

$ ruby -v
ruby 2.5.9p229 (2021-04-05 revision 67939) [x86_64-darwin19]
$ bundle exec rake clean compile
(...)
creating extconf.h
checking for vm_core.h... no
checking for vm_core.h... no
**************************************************************************
No source for ruby-2.5.9-p229 (revision 67939) provided with
debase-ruby_core_source gem. Falling back to ruby-2.5.1-p57.
**************************************************************************
checking for vm_core.h... yes

Currently they have the first version of every Ruby release (all the .0). For some versions, they're missing the latest point release (they have 2.2.9 but not 2.2.10), but these core VM apis seem to rarely shift, and we also test with the latest versions in our CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major concern here is would this dependency in anyway prevent users who use tracing on supported versions, platforms and engines from installing & using tracing, if they were to upgrade ddtrace and use this dependency?

Sounds like there's some kind of fallback mechanism; is this a sufficient guarantee against the prior question?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replied in context in #1740 (comment) since I was already touching on this subject.

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just left a question about debase-ruby_core_source headers.

@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 28, 2021

Any concerns, @delner ?

@ivoanjo ivoanjo requested a review from delner October 28, 2021 09:36
@delner
Copy link
Contributor

delner commented Nov 3, 2021

@ivoanjo Sorry, will take a look at this!

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat being it's hard to wrap my head around all the details here, especially on the JIT-side, much of the changes in Ruby seem fine to me. I think it's a pretty strong sales pitch to drop the Thread monkey patch for a simpler, more effective C implementation.

If I'm understanding the strategy here; we used to use the Thread monkey-patch to access the native thread ID, since only a Thread could access its own native thread ID given the Ruby C API. However, because we're adding headers through JIT or using debase (somehow?), it makes this once inaccessible API available on the C-side, thus nullifying the need for the monkey patch. Do I have that right?

Only two major concerns I have, although they may already be adequately addressed, are:

  1. ddtrace users who don't need or want profiling. Does the debase extension at all prevent & negatively impact their usage of tracing, CI, or other products present in ddtrace? (Your description makes it sound like this is not the case, but just want to be totally clear on that.)
  2. Security: injecting C headers from debase seems like a potential vulnerability. It helps that you say this repo is maintained by a reputable group. But it might not be a bad idea to check with a security expert on this first, just to make sure its kosher.

I think if we can just clarify those two things, this should be good on my end.

ddtrace.gemspec Outdated
@@ -47,7 +47,7 @@ Gem::Specification.new do |spec|
end

# Used by the profiler
spec.add_dependency 'ffi', '~> 1.0'
spec.add_dependency 'debase-ruby_core_source', '>= 0.10.12'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major concern here is would this dependency in anyway prevent users who use tracing on supported versions, platforms and engines from installing & using tracing, if they were to upgrade ddtrace and use this dependency?

Sounds like there's some kind of fallback mechanism; is this a sufficient guarantee against the prior question?

@@ -32,7 +34,8 @@ class Stack < Worker # rubocop:disable Metrics/ClassLength
:trace_identifiers_helper,
:ignore_thread,
:max_time_usage_pct,
:thread_api
:thread_api,
:cpu_time_provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this abstraction.

@ivoanjo
Copy link
Member Author

ivoanjo commented Nov 4, 2021

If I'm understanding the strategy here; we used to use the Thread monkey-patch to access the native thread ID, since only a Thread could access its own native thread ID given the Ruby C API. However, because we're adding headers through JIT or using debase (somehow?), it makes this once inaccessible API available on the C-side, thus nullifying the need for the monkey patch. Do I have that right?

Yes. There's quite a bit of boilerplate and handling of Ruby versions and OS's, but in the end it all boils down to building this pthread_id_for function:

static inline struct rb_thread_struct *thread_struct_from_object(VALUE thread) {
static const rb_data_type_t *thread_data_type = NULL;
if (thread_data_type == NULL) thread_data_type = RTYPEDDATA_TYPE(rb_thread_current());
return (struct rb_thread_struct *) rb_check_typeddata(thread, thread_data_type);
}
rb_nativethread_id_t pthread_id_for(VALUE thread) {
return thread_struct_from_object(thread)->thread_id;
}

...that given any Thread instance (at at any point in its life cycle), gets us back the same native thread id we previously were getting through the monkey patching approach, and that MRI Ruby keeps, but does not publicly expose.

I did change a bit how the Stack sampler accesses the data to simplify it, but the crux of the change is those 9 lines.

Only two major concerns I have, although they may already be adequately addressed, are:

  1. ddtrace users who don't need or want profiling. Does the debase extension at all prevent & negatively impact their usage of tracing, CI, or other products present in ddtrace? (Your description makes it sound like this is not the case, but just want to be totally clear on that.)

I expect no impact on other products.

We have been shipping and compiling an empty native extension (#1584) since 0.52.0 (released early August). So in ~3 months of it being out there, no customers have reported any issues with this, and validating this was my intent when I shipped an empty extension.

This was the intended result, because we depend on both msgpack and ffi which need to be compiled as well, so I fully expected customers to have a working build environment, and it's great to be able to validate it.

Specifically for debase-ruby_core_source, do note that by itself, it does not provide any extension -- it only provides the missing Ruby headers so the profiling native extension can be compiled. This distinction is quite important because it means that we are in control of its usage, and we can choose to not use it entirely:

  • On Ruby 2.6+ we never use debase-ruby_core_source and instead use the built-it MJIT header
  • On JRuby/TruffleRuby (and soon, I expect, Windows) we entirely skip building our extension
  • Users can also manually use DD_PROFILING_NO_EXTENSION=true to disable building and using the extension (including debase-ruby_core_source). I hope this never needs to be used, but it provides a last big lever that can be used to make sure we never impact ddtrace users.

Finally, but quite importantly, loading the profiling native extension is carefully guarded and checked:

private_class_method def self.native_library_failed_to_load?
success, exception = try_loading_native_library
unless success
if exception
'There was an error loading the profiling native extension due to ' \
"'#{exception.message}' at '#{exception.backtrace.first}'"
else
'The profiling native extension did not load correctly. ' \
'If the error persists, please contact support via <https://docs.datadoghq.com/help/> or ' \
'file a bug at <https://github.com/DataDog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug>.'
end
end
end
private_class_method def self.try_loading_native_library
if Datadog::Core::Environment::VariableHelpers.env_to_bool('DD_PROFILING_NO_EXTENSION', false)
SKIPPED_NATIVE_EXTENSION_ONLY_ONCE.run do
Kernel.warn(
'[DDTRACE] Skipped loading of profiling native extension due to DD_PROFILING_NO_EXTENSION environment ' \
'variable being set. ' \
'This option is experimental and will lead to the profiler not working in future releases. ' \
'If you needed to use this, please tell us why on <https://github.com/DataDog/dd-trace-rb/issues/new>.'
)
end
return [true, nil]
end
begin
require "ddtrace_profiling_native_extension.#{RUBY_VERSION}_#{RUBY_PLATFORM}"
success =
defined?(Datadog::Profiling::NativeExtension) && Datadog::Profiling::NativeExtension.send(:native_working?)
[success, nil]
rescue StandardError, LoadError => e
[false, e]
end
end

...to ensure that if the native extension fails to load (or fails its self-check), we gracefully handle this with a simple log message and allow the user code to keep running with no impact.

Major concern here is would this dependency in anyway prevent users who use tracing on supported versions, platforms and engines from installing & using tracing, if they were to upgrade ddtrace and use this dependency?

Sounds like there's some kind of fallback mechanism; is this a sufficient guarantee against the prior question?

I alluded to this above, but to clarify, this dependency is a pure-Ruby gem, which only ships a bit of Ruby code, and then the Ruby headers, which unless directly used by another gem (like ddtrace) just lie dormant.

So I can't think of any way for this gem, by itself, of ever failing to install on any platform, version, or engine. Of course, this gem is useless if you are on JRuby or TruffleRuby or Rubinius, but still cleanly installs on those.

When used by dd-trace-rb, we have full control over when to use it (as explained above), so we can turn it on only for the setups that make sense.

  1. Security: injecting C headers from debase seems like a potential vulnerability. It helps that you say this repo is maintained by a reputable group. But it might not be a bad idea to check with a security expert on this first, just to make sure its kosher.

I will follow up on this, thanks for raising it.

Worst case, we can fork the package just to make sure we have control of the release, but I hope it won't come to that.

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter is failing, but once that's fixed this is good to go on my end.

Thanks for the thorough and detailed explanations! I think this will be a good change.

@ivoanjo
Copy link
Member Author

ivoanjo commented Nov 5, 2021

Linter broke due to a new Rubocop version:

Offenses:

lib/ddtrace/contrib/delayed_job/plugin.rb:12:18: C: [Correctable] Performance/RedundantBlockCall: Use yield instead of block.call.
          return block.call(job) unless tracer && tracer.enabled
                 ^^^^^^^^^^^^^^^
lib/ddtrace/contrib/delayed_job/plugin.rb:32:18: C: [Correctable] Performance/RedundantBlockCall: Use yield instead of block.call.
          return block.call(job) unless tracer && tracer.enabled
                 ^^^^^^^^^^^^^^^

944 files inspected, 2 offenses detected, 2 offenses auto-correctable
RuboCop failed!

This fix is already in master a7801f8 . I'll merge this in and confirm that the master build is green.

@ivoanjo ivoanjo merged commit a1d917b into master Nov 5, 2021
@ivoanjo ivoanjo deleted the ivoanjo/prof-4198-native-cpu-time-on-older-rubies branch November 5, 2021 16:50
@github-actions github-actions bot added this to the 0.54.0 milestone Nov 5, 2021
@ivoanjo
Copy link
Member Author

ivoanjo commented Nov 5, 2021

Master build successful: [1], [2], [3]

ivoanjo added a commit that referenced this pull request Nov 15, 2022
Our `Thread` monkey patch, including `#update_native_ids` was removed
in #1740.

This code was left over in `setup.rb` since we had a fallback for
`Thread`s without the monkey patch, but effectively never runs.
ivoanjo added a commit that referenced this pull request Nov 28, 2022
**What does this PR do?**:

Allow the latest `debase-ruby_core_source` version to be used by
dd-trace-rb.

**Motivation**:

As discussed in #1740, we decided to pin the range of versions of
`debase-ruby_core_source` we use, so that we can validate each
version as it comes out.

The changes on 0.10.18
(<https://my.diffend.io/gems/debase-ruby_core_source/0.10.17/0.10.18>)
don't impact dd-trace-rb, because they only affect Ruby 3.2-preview3
and we don't use `debase-ruby_core_source` on Ruby >= 2.6.

Nevertheless, it's important we keep our version up-to-date as
customers may need the latest version of `debase-ruby_core_source`
for other gems they use.

**Additional Notes**:

(None)

**How to test the change?**:

CI should still be green for all supported Ruby versions.
ivoanjo added a commit that referenced this pull request Jan 5, 2023
**What does this PR do?**:

Allow using the latest `debase-ruby_core_source` together with
dd-trace-rb.

**Motivation**:

As discussed in #1740, we decided to pin the range of versions of
`debase-ruby_core_source` we use, so that we can validate each version
as it comes out.

The changes on 3.2.0
(<https://my.diffend.io/gems/debase-ruby_core_source/0.10.18/3.2.0>)
don't impact dd-trace-rb, because they only affect Ruby 3.2.0 and 2.0.0
and we don't use `debase-ruby_core_source` on Ruby >= 2.6 (nor support
Ruby 2.0).

Nevertheless, it's important we keep our version up-to-date as customers
may need the latest version of `debase-ruby_core_source` for other gems
they use.

**Additional Notes**:

(None)

**How to test the change?**:

CI should still be green for all supported Ruby versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants