Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10588] Support GVL profiling preview on Ruby 3.2 #3939

Merged
merged 9 commits into from
Oct 4, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Sep 23, 2024

What does this PR do?

This PR builds atop #3929 that added support for GVL profiling for Ruby 3.3+ and makes GVL profiling also work for Ruby 3.2 .

Supporting GVL profiling on Ruby 3.2 needed special additional work. That's because while in Ruby 3.2 we have the GVL instrumentation API giving us the events we need to profile the GVL, we're missing:

  1. Getting the Ruby thread VALUE as an argument in GVL instrumentation API events
  2. The rb_internal_thread_specific API that allows us to attach in a thread-safe way data to Ruby thread objects

Both 1 and 2 were only introduced in Ruby 3.3, and our implementation of GVL profiling relies/relied on them.

This PR... reimplements 1 & 2 in an alternative way, allowing us to keep our existing design for 3.3+, while also supporting the older Ruby version.

I've split it into two commits:

i. Abstracting access and management of 1 & 2 into a new set of files (gvl_profiling_helper.c/gvl_profiling_helper.h). These new files are zero-overhead abstractions for most situations.

ii. Implementing 1 & 2 for Ruby 3.2.

Motivation:

We believe GVL profiling is quite an important observability feature for Ruby, and thus we want to support it on all Ruby versions that provide the GVL instrumentation API.

Additional Notes:

To solve 1, we're using native level thread-locals (GCC's __thread) to keep a pointer to the underlying Ruby rb_thread_t structure.

This is more complex than than "just keep it on a thread-local" because:

a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the underlying native thread around for a bit, and if another Ruby thread is born very quickly after the previous one, Ruby will reuse the native thread and attach it to the new Ruby thread.

To avoid incorrectly reusing the thread-locals, we install an event hook on Ruby thread start, and make sure to clean any native thread-locals when a new thread stats.

b) Some of the GVL instrumentation API events are emitted while the thread does not have the GVL and so we need to be careful when we can and cannot read VM information.

Thus, we only initialize the thread-local during the RUBY_INTERNAL_THREAD_EVENT_RESUMED which is emitted while the thread owns the GVL.

c) Since we don't get the current thread in events, we need to get a bit... creative. Thus, what we do is in RUBY_INTERNAL_THREAD_EVENT_RESUMED, because we know the current thread MUST own the GVL, we read from the internal Ruby VM state which thread is the GVL owner to find the info we need.

With a + b + c together we are able to keep a pointer to the underlying rb_thread_t up-to-date in a native thread local, thus replacing the need to get a VALUE thread as an argument.

To solve 2, we rely on an important observation: there's a VALUE stat_insn_usage field inside rb_thread_t that's unused and seems to have effectively been forgotten about.

There's nowhere in the VM code that's writing or reading it (other than marking it for GC), and not even git history reveals a time where this field was used. I could not find any other references to this field anywhere else. Thus, we make use of this field to store the information we need, as a replacement for rb_internal_thread_specific.

Since presumably Ruby 3.2 will never see this field either removed or used during its remaining maintenance release period this should work fine, and we have a nice clean solution for 3.3+.

How to test the change?

Happily, with the changes on this PR, the existing test coverage we added for GVL profiling on 3.3 is also green on 3.2! :)

And here's a few screenshots of the same test scripts I've included in #3929, now with Ruby 3.2:

…reads

This is a first step towards supporting GVL profiling in Ruby 3.2:
for Ruby 3.2 we'll do an entirely different implementation of this
interface, as a replacement for `rb_internal_thread_specific_*` which
were only added on 3.3.
**What does this PR do?**

This PR builds atop #3929 that added support for GVL profiling for
Ruby 3.3+ and makes GVL profiling also work for Ruby 3.2 .

Supporting GVL profiling on Ruby 3.2 needed special additional work.
That's because while in Ruby 3.2 we have the GVL instrumentation API
giving us the events we need to profile the GVL, we're missing:

1. Getting the Ruby thread `VALUE` as an argument in GVL instrumentation
   API events
2. The `rb_internal_thread_specific` API that allows us to attach in
   a thread-safe way data to Ruby thread objects

Both 1 and 2 were only introduced in Ruby 3.3, and our implementation
of GVL profiling relies/relied on them.

This PR... reimplements 1 & 2 in an alternative way, allowing us to
keep our existing design for 3.3+, while also supporting the older
Ruby version.

I've split it into two commits:
 i. Abstracting access and management of 1 & 2 into a new set of files
    (`gvl_profiling_helper.c`/`gvl_profiling_helper.h`). These new files
    are zero-overhead abstractions for most situations.
ii. Implementing 1 & 2 for Ruby 3.2.

**Motivation:**

We believe GVL profiling is quite an important observability feature
for Ruby, and thus we want to support it on all Ruby versions that
provide the GVL instrumentation API.

**Additional Notes:**

To solve 1, we're using native level thread-locals (GCC's
`__thread`) to keep a pointer to the underlying Ruby `rb_thread_t`
structure.

This is more complex than than "just keep it on a thread-local"
because:

a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the
   underlying native thread around for a bit, and if another Ruby
   thread is born very quickly after the previous one, Ruby
   will reuse the native thread and attach it to the new Ruby thread.

   To avoid incorrectly reusing the thread-locals, we install an event
   hook on Ruby thread start, and make sure to clean any native
   thread-locals when a new thread stats.

b) Some of the GVL instrumentation API events are emitted while the
   thread does not have the GVL and so we need to be careful when we can
   and cannot read VM information.

   Thus, we only initialize the thread-local during the
   `RUBY_INTERNAL_THREAD_EVENT_RESUMED` which is emitted while the
   thread owns the GVL.

c) Since we don't get the current thread in events, we need to get a
   bit... creative. Thus, what we do is in
   `RUBY_INTERNAL_THREAD_EVENT_RESUMED`, because we know the current
   thread MUST own the GVL, we read from the internal Ruby VM state
   which thread is the GVL owner to find the info we need.

With a + b + c together we are able to keep a pointer to the underlying
`rb_thread_t` up-to-date in a native thread local, thus replacing the
need to get a `VALUE thread` as an argument.

To solve 2, we rely on an important observation: there's a
`VALUE stat_insn_usage` field inside `rb_thread_t` that's unused and
seems to have effectively been forgotten about.

There's nowhere in the VM code that's writing or reading it (other
than marking it for GC), and not even git history reveals a time
where this field was used. I could not find any other references to
this field anywhere else. Thus, we make use of this field to store
the information we need, as a replacement for
`rb_internal_thread_specific`.

Since presumably Ruby 3.2 will never see this field either removed or used
during its remaining maintenance release period this should work fine,
and we have a nice clean solution for 3.3+.

**How to test the change?**

Happily, with the changes on this PR, the existing test coverage
we added for GVL profiling on 3.3 is also green on 3.2! :)
…state

This is a "just in case" change --
`thread_context_collector_sample_after_gvl_running_with_thread`
_should_ always be called with a thread that has finished Waiting for
the GVL.

...but just in case it ended up being called by the wrong thread, let's
not touch the context until we're sure we're in the right thread.
@ivoanjo ivoanjo requested a review from a team as a code owner September 23, 2024 10:35
@ivoanjo ivoanjo added feature Involves a product feature profiling Involves Datadog profiling labels Sep 23, 2024
@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.78%. Comparing base (be8882a) to head (00b5047).
Report is 57 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3939   +/-   ##
=======================================
  Coverage   97.77%   97.78%           
=======================================
  Files        1297     1297           
  Lines       77975    77978    +3     
  Branches     3888     3888           
=======================================
+ Hits        76240    76250   +10     
+ Misses       1735     1728    -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented Sep 23, 2024

Benchmarks

Benchmark execution time: 2024-10-04 11:36:09

Comparing candidate commit 00b5047 in PR branch ivoanjo/prof-10588-gvl-profiling-ruby32 with baseline commit be8882a in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 22 metrics, 2 unstable metrics.

scenario:profiler - Allocations ()

  • 🟩 throughput [+188548.077op/s; +243006.895op/s] or [+5.988%; +7.718%]

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@ivoanjo
Copy link
Member Author

ivoanjo commented Sep 25, 2024

Update: I'm still finishing up some testing, and if all goes well I plan to merge this in next week.

Copy link
Contributor

@p-datadog p-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3929 has markdown docs describing limitations of 3.2, should that be updated in this PR now that 3.2 is usable?

@ivoanjo ivoanjo requested a review from a team as a code owner October 2, 2024 12:32
@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 2, 2024

Good catch, I've updated the development docs in 97751d7 to reflect that we now support 3.2 (and how), based on the description of this PR.

I saw the following failure in CI
https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/16646/workflows/5a843db3-b63e-4855-8ce9-432c0405e0bb/jobs/597972

```
Failures:

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working when GVL profiling is enabled records Waiting for GVL samples
     Failure/Error: expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time)
       expected 402884285 to be within 5% of 492523454
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:472:in `block (5 levels) in <top (required)>'
     # ./spec/spec_helper.rb:231:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:116:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.24.0/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'
```

and I'm not sure why the time accounting is not as expected.

The difference close to ~100ms seems close to the usual 100ms that Ruby
gives each thread to run on CPU, but I'm not quite sure how it would
factor in here.

To be able to further investigate the next time this shows up, I'm
adding extra information that hopefully will point the way.
@ivoanjo ivoanjo merged commit 2cde72a into master Oct 4, 2024
199 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10588-gvl-profiling-ruby32 branch October 4, 2024 15:23
@github-actions github-actions bot added this to the 2.4.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Involves a product feature profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants