-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-10588] Support GVL profiling preview on Ruby 3.2 #3939
Conversation
…reads This is a first step towards supporting GVL profiling in Ruby 3.2: for Ruby 3.2 we'll do an entirely different implementation of this interface, as a replacement for `rb_internal_thread_specific_*` which were only added on 3.3.
**What does this PR do?** This PR builds atop #3929 that added support for GVL profiling for Ruby 3.3+ and makes GVL profiling also work for Ruby 3.2 . Supporting GVL profiling on Ruby 3.2 needed special additional work. That's because while in Ruby 3.2 we have the GVL instrumentation API giving us the events we need to profile the GVL, we're missing: 1. Getting the Ruby thread `VALUE` as an argument in GVL instrumentation API events 2. The `rb_internal_thread_specific` API that allows us to attach in a thread-safe way data to Ruby thread objects Both 1 and 2 were only introduced in Ruby 3.3, and our implementation of GVL profiling relies/relied on them. This PR... reimplements 1 & 2 in an alternative way, allowing us to keep our existing design for 3.3+, while also supporting the older Ruby version. I've split it into two commits: i. Abstracting access and management of 1 & 2 into a new set of files (`gvl_profiling_helper.c`/`gvl_profiling_helper.h`). These new files are zero-overhead abstractions for most situations. ii. Implementing 1 & 2 for Ruby 3.2. **Motivation:** We believe GVL profiling is quite an important observability feature for Ruby, and thus we want to support it on all Ruby versions that provide the GVL instrumentation API. **Additional Notes:** To solve 1, we're using native level thread-locals (GCC's `__thread`) to keep a pointer to the underlying Ruby `rb_thread_t` structure. This is more complex than than "just keep it on a thread-local" because: a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the underlying native thread around for a bit, and if another Ruby thread is born very quickly after the previous one, Ruby will reuse the native thread and attach it to the new Ruby thread. To avoid incorrectly reusing the thread-locals, we install an event hook on Ruby thread start, and make sure to clean any native thread-locals when a new thread stats. b) Some of the GVL instrumentation API events are emitted while the thread does not have the GVL and so we need to be careful when we can and cannot read VM information. Thus, we only initialize the thread-local during the `RUBY_INTERNAL_THREAD_EVENT_RESUMED` which is emitted while the thread owns the GVL. c) Since we don't get the current thread in events, we need to get a bit... creative. Thus, what we do is in `RUBY_INTERNAL_THREAD_EVENT_RESUMED`, because we know the current thread MUST own the GVL, we read from the internal Ruby VM state which thread is the GVL owner to find the info we need. With a + b + c together we are able to keep a pointer to the underlying `rb_thread_t` up-to-date in a native thread local, thus replacing the need to get a `VALUE thread` as an argument. To solve 2, we rely on an important observation: there's a `VALUE stat_insn_usage` field inside `rb_thread_t` that's unused and seems to have effectively been forgotten about. There's nowhere in the VM code that's writing or reading it (other than marking it for GC), and not even git history reveals a time where this field was used. I could not find any other references to this field anywhere else. Thus, we make use of this field to store the information we need, as a replacement for `rb_internal_thread_specific`. Since presumably Ruby 3.2 will never see this field either removed or used during its remaining maintenance release period this should work fine, and we have a nice clean solution for 3.3+. **How to test the change?** Happily, with the changes on this PR, the existing test coverage we added for GVL profiling on 3.3 is also green on 3.2! :)
…state This is a "just in case" change -- `thread_context_collector_sample_after_gvl_running_with_thread` _should_ always be called with a thread that has finished Waiting for the GVL. ...but just in case it ended up being called by the wrong thread, let's not touch the context until we're sure we're in the right thread.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3939 +/- ##
=======================================
Coverage 97.77% 97.78%
=======================================
Files 1297 1297
Lines 77975 77978 +3
Branches 3888 3888
=======================================
+ Hits 76240 76250 +10
+ Misses 1735 1728 -7 ☔ View full report in Codecov by Sentry. |
BenchmarksBenchmark execution time: 2024-10-04 11:36:09 Comparing candidate commit 00b5047 in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 22 metrics, 2 unstable metrics. scenario:profiler - Allocations ()
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
Update: I'm still finishing up some testing, and if all goes well I plan to merge this in next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3929 has markdown docs describing limitations of 3.2, should that be updated in this PR now that 3.2 is usable?
Good catch, I've updated the development docs in 97751d7 to reflect that we now support 3.2 (and how), based on the description of this PR. |
I saw the following failure in CI https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/16646/workflows/5a843db3-b63e-4855-8ce9-432c0405e0bb/jobs/597972 ``` Failures: 1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working when GVL profiling is enabled records Waiting for GVL samples Failure/Error: expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time) expected 402884285 to be within 5% of 492523454 # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:472:in `block (5 levels) in <top (required)>' # ./spec/spec_helper.rb:231:in `block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:116:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/webmock-3.24.0/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>' ``` and I'm not sure why the time accounting is not as expected. The difference close to ~100ms seems close to the usual 100ms that Ruby gives each thread to run on CPU, but I'm not quite sure how it would factor in here. To be able to further investigate the next time this shows up, I'm adding extra information that hopefully will point the way.
What does this PR do?
This PR builds atop #3929 that added support for GVL profiling for Ruby 3.3+ and makes GVL profiling also work for Ruby 3.2 .
Supporting GVL profiling on Ruby 3.2 needed special additional work. That's because while in Ruby 3.2 we have the GVL instrumentation API giving us the events we need to profile the GVL, we're missing:
VALUE
as an argument in GVL instrumentation API eventsrb_internal_thread_specific
API that allows us to attach in a thread-safe way data to Ruby thread objectsBoth 1 and 2 were only introduced in Ruby 3.3, and our implementation of GVL profiling relies/relied on them.
This PR... reimplements 1 & 2 in an alternative way, allowing us to keep our existing design for 3.3+, while also supporting the older Ruby version.
I've split it into two commits:
i. Abstracting access and management of 1 & 2 into a new set of files (
gvl_profiling_helper.c
/gvl_profiling_helper.h
). These new files are zero-overhead abstractions for most situations.ii. Implementing 1 & 2 for Ruby 3.2.
Motivation:
We believe GVL profiling is quite an important observability feature for Ruby, and thus we want to support it on all Ruby versions that provide the GVL instrumentation API.
Additional Notes:
To solve 1, we're using native level thread-locals (GCC's
__thread
) to keep a pointer to the underlying Rubyrb_thread_t
structure.This is more complex than than "just keep it on a thread-local" because:
a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the underlying native thread around for a bit, and if another Ruby thread is born very quickly after the previous one, Ruby will reuse the native thread and attach it to the new Ruby thread.
To avoid incorrectly reusing the thread-locals, we install an event hook on Ruby thread start, and make sure to clean any native thread-locals when a new thread stats.
b) Some of the GVL instrumentation API events are emitted while the thread does not have the GVL and so we need to be careful when we can and cannot read VM information.
Thus, we only initialize the thread-local during the
RUBY_INTERNAL_THREAD_EVENT_RESUMED
which is emitted while the thread owns the GVL.c) Since we don't get the current thread in events, we need to get a bit... creative. Thus, what we do is in
RUBY_INTERNAL_THREAD_EVENT_RESUMED
, because we know the current thread MUST own the GVL, we read from the internal Ruby VM state which thread is the GVL owner to find the info we need.With a + b + c together we are able to keep a pointer to the underlying
rb_thread_t
up-to-date in a native thread local, thus replacing the need to get aVALUE thread
as an argument.To solve 2, we rely on an important observation: there's a
VALUE stat_insn_usage
field insiderb_thread_t
that's unused and seems to have effectively been forgotten about.There's nowhere in the VM code that's writing or reading it (other than marking it for GC), and not even git history reveals a time where this field was used. I could not find any other references to this field anywhere else. Thus, we make use of this field to store the information we need, as a replacement for
rb_internal_thread_specific
.Since presumably Ruby 3.2 will never see this field either removed or used during its remaining maintenance release period this should work fine, and we have a nice clean solution for 3.3+.
How to test the change?
Happily, with the changes on this PR, the existing test coverage we added for GVL profiling on 3.3 is also green on 3.2! :)
And here's a few screenshots of the same test scripts I've included in #3929, now with Ruby 3.2:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/example1.rb:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/example4.rb:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example2.rb:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example4.rb:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example6.rb: