Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-4904] New Stack collector is feature-complete #2000

Merged
merged 58 commits into from
May 24, 2022

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented May 4, 2022

The new stack collector introduced in #1966 gets feature-complete in this PR.

By feature-complete I mean:

  • There is extensive test coverage with many corner cases being checked
  • Our custom rb_profile_frames got quite a lot of fixes to exactly match the Ruby reference stack trace api (Thread#backtrace_locations)
  • It works for all supported Rubies, including 2.1 and 2.2 which were a though nut to crack
  • It matches the behavior of the old profiling codepaths, e.g. it has placeholders for threads with empty stacks that are running in native code and for threads with too deep stacks
  • etc...

I don't quite like how long and how complex this PR got. Thanks for putting up with it 😅 , and feel free to challenge me if this should be further broken down. Reviewing it commit-by-commit may make more sense, helping to show what bits were added why.

Note: This PR builds on top of #1966.
Note 2: I'm aware that specs for "Datadog::Profiling::Collectors::Stack when sampling a thread with a stack that is deeper than the configured max_frames" are failing. I missed this previously as it did not happen on macOS (works fine on macOS). I'll push a fix for it ASAP, but otherwise this is ready for reviewing.

@ivoanjo ivoanjo requested a review from a team May 4, 2022 10:53
@ivoanjo ivoanjo changed the title [PROF-4904] New Stack collector feature-complete [PROF-4904] New Stack collector is feature-complete May 4, 2022
@ivoanjo ivoanjo marked this pull request as draft May 5, 2022 07:31
@ivoanjo
Copy link
Member Author

ivoanjo commented May 5, 2022

Ooops mistakenly forgot to open this as a draft!

@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4904-native-stack-collector-continued branch from 31a46cf to 5814cf1 Compare May 18, 2022 09:35
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4904-native-stack-collector-part3 branch from 1fb0dea to bb89cce Compare May 18, 2022 09:53
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4904-native-stack-collector-continued branch from 5814cf1 to 9d35f1e Compare May 19, 2022 13:39
ivoanjo added a commit that referenced this pull request May 19, 2022
… in next PR)

Making the stack collector work on older Rubies is quite a lot of work,
which I'd like to submit as a separate PR.

To avoid our CI breaking, let's disable a few profiler things,
which we will re-enable in
<#2000>.

Note that *NONE* of the components being disabled or messed
with are enabled for customers -- they're still not on in any
situation -- so this strategy of disabling stuff is just an
enabler for incremental development.
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4904-native-stack-collector-part3 branch from bb89cce to 20c1c4b Compare May 20, 2022 10:14
@ivoanjo ivoanjo marked this pull request as ready for review May 20, 2022 11:17
Comment on lines +111 to +112
// * If we don't move it into a different thread, does releasing the GVL on a Ruby thread mean that we're introducing
// a new thread switch point where there previously was none?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my first thought when I was reading this method and thinking of moving out of the GVL. Of course, profiling the code performance here is will tell us, but I do agree that it's very possible that the complexity introduced might negate any performance gains.

Copy link
Member Author

@ivoanjo ivoanjo May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Part of the awkwardness of releasing the GVL or even sending the sample to a different thread that operates without the GVL is that we'd need to duplicate a lot of the Ruby-level stuff, because it requires the GVL being held to access it. That's annoying because it would introduce an extra copy; tracking of a bunch of heap-allocated stuff, etc.

There's been some discussion with the other profiler libraries about possibly changing libddprof/libdatadog to expose its string table. I've been thinking it could be used to "solve" this issue -- e.g. we could copy all the Ruby strings into the string table (with the GVL being held), avoiding any extra copies, and then we'd have a lot less data to send to a background thread.

So yeah, definitely I'll be keeping an eye on this one.

Comment on lines +223 to +224
// To give customers visibility into these threads, rather than reporting an empty stack, we replace the empty stack
// with one containing a placeholder frame, so that these threads are properly represented in the UX.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice usability feature for users. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! We already had this for the old profiler as well: #1719 :)

if (max_frames < 5) rb_raise(rb_eArgError, "Invalid max_frames: value must be >= 5");
if (max_frames > MAX_FRAMES_LIMIT) rb_raise(rb_eArgError, "Invalid max_frames: value must be <= " MAX_FRAMES_LIMIT_AS_STRING);

sampling_buffer* buffer = xcalloc(1, sizeof(sampling_buffer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the "checked memory allocation" variants (e.g. xcalloc) recommended for the Ruby VM? I'm assuming so and the Ruby VM has a proper global error handler configured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thanks for calling me out on this!

xcalloc is indeed provided by Ruby -- it's aliased to ruby_xcalloc:

 * Allocates a  storage instance.  It is  largely the same as  system malloc(),
 * except:
 *
 *   - It raises Ruby exceptions instead of returning NULL, and
 *   - In case of `ENOMEM` it tries to GC to make some room.

https://github.com/ruby/ruby/blob/0293f8ca8e1fbe8e4202c2d211bb11f4e0ffa5fd/include/ruby/internal/xmalloc.h#L53-L147

The part I had missed thus far is this one:

 * @note       It doesn't return NULL.

Because Ruby will call ruby_memerror aka try to raise our friend the NoMemoryError exception in case of NULL. So I've been adding more NULL-checks than needed. I'll do a pass an clean-up any such unneeded checks :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 68e4bc5

if (max_frames > MAX_FRAMES_LIMIT) rb_raise(rb_eArgError, "Invalid max_frames: value must be <= " MAX_FRAMES_LIMIT_AS_STRING);

sampling_buffer* buffer = xcalloc(1, sizeof(sampling_buffer));
if (buffer == NULL) rb_memerror();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the comment above, if we are calling the x variant of calloc, will this code be reachable with a buffer == NULL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're 💯% right, will remove :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 68e4bc5

Comment on lines 276 to 277
// The only case where any of the underlying arrays are NULL is when initial allocation failed; otherwise they
// can be assumed to be not-null.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than one system I've worked with in the past used this pattern of free if not NULL. I think it's a fair safety pattern, if a bit defensive. 👍

ptrdiff_t stack_depth_for(VALUE thread) {
#ifndef USE_THREAD_INSTEAD_OF_EXECUTION_CONTEXT // Modern Rubies
const rb_execution_context_t *ec = thread_struct_from_object(thread)->ec;
#else // Ruby < 2.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to say "thank you" for documenting what parts of our code can be dropped in the future when we deprecate support for specific versions of Ruby.
There are self evident sections if RUBY_VERSION < '2.3' and those are great, but I want to highlight the importance of documenting the non-self evident ones. So yeah, thank you again!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm looking forward to dropping support for 2.1 older Rubies on this part of the code.

I also plan to submit everything I can upstream, so that hopefully at some point in the 3.x series, we will no longer need private_vm_api_access.c for those Rubies :)

const static ddprof_ffi_ValueType enabled_value_types[] = {CPU_TIME_VALUE, CPU_SAMPLES_VALUE, WALL_TIME_VALUE};
static const ddprof_ffi_ValueType enabled_value_types[] = {CPU_TIME_VALUE, CPU_SAMPLES_VALUE, WALL_TIME_VALUE};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the most important change right here!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:hurtrealbad:

include IbhModuleD
end

$ibh_a_proc = proc { IbhClassC.new.hello }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Globals! How about slapping all these $ bad boys into a module in this file, like IbhGlobal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear there, my suggestion will likely accomplish the same level of code isolation, but I just find it jarring to see these $ around me. It makes me slightly confused when I glance a random $ while reading code here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ce1c971

Base automatically changed from ivoanjo/prof-4904-native-stack-collector-continued to master May 23, 2022 09:31
ivoanjo added 16 commits May 23, 2022 10:46
…reverted in next PR)"

This reverts commit 2334216.

The commit being reverted was only added so that we could merge the
previous PR and still have CI happy and green.
Stack allocation for arrays is... weird and complicated so I've
extracted all of the buffers needed for sampling into a
dynamically-allocated struct.

This is also the first step towards making the maximum stack depth
configurable from Ruby code.
A frame representing a method implemented using native code needs to
be handled in a special way:

* we need to use a different API to get the method name
* we need to use the path and line number of its direct/indirect Ruby
  caller (if we want to match the Ruby backtrace APIs at least)

To implement this, I further extended our custom `rb_profile_frames`
to support flagging a method as being a Ruby frame or not.

In terms of memory we could probably pack things a bit more (e.g.
not use a space-wasting array of bool) but I'll leave such
optimizations for later.

Finally, the `rb_profile_frame_method_name` method only returns
the actual method name for a native method on Ruby 3.0+.

To support older Rubies, a further commit will introduce a few more
copy-pasted backported code.
…bies

Ruby 3.0 finally added support for showing CFUNC frames (frames for
methods written using native code) in stack traces gathered via
`rb_profile_frames` (ruby/ruby#3299).

To access this information on older Rubies, beyond using our custom
`ddtrace_rb_profile_frames`, we also need to backport the Ruby 3.0+
version of `rb_profile_frame_method_name`.
It turns out the awful hack we had thus far to change the `extconf.h`
file is not actually needed, and can be simplified.
It's a good idea for `private_vm_api_access.c` to include
`private_vm_api_access.h` so that the compiler can complain at us if
they got out of sync.

We previously were not doing it because `private_vm_api_access.c`
includes the `RUBY_MJIT_HEADER` which can conflict with other Ruby
headers that were being included in `private_vm_api_access.h`.

To fix this, I've introduced an extra define --
`PRIVATE_VM_API_ACCESS_SKIP_RUBY_INCLUDES` which is used in
`private_vm_api_access.h` to decide if the Ruby includes should or
should not be included.

(I also did a minor cleanup to introduce some indentation on the
preprocessor definitions to make them more readable, especially
when nested.)
We always specify `lines`, so there's no need to carry code
designed to make that argument optional.
A few compatibility fixes to support older Rubies:
* Use `RHASH_SIZE` instead of `rb_hash_size_num`
* Fix a few warnings regarding integer sizes
* Import `iseq.h` internal header, it's needed on older Rubies
  to support our custom `rb_profile_frames`
* Import `rb_vm_frame_method_entry`, it's needed on older Rubies
  to support our custom `rb_profile_frames` (this included importing
  a few more things needed to support that function)
The 3.20 series dropped support for 2.4 but somehow rubygems/bundler
still try to install it.
A few compatibility fixes to support older Rubies:
* Add missing header to be able to use bool
* Use `rb_thread_t` instead of `rb_execution_context_t`. The latter
  was extracted from the former starting on Ruby 2.5, so on older
  Rubies we use the former directly.
A lot of macros in `rb_vm_frame_method_entry` did not exist back in
2.3 so I decided to import the Ruby 2.3 version of
`rb_vm_frame_method_entry`.
…elow

The `rb_profile_frames` function changed quite a bit between Ruby 2.2
and 2.3.

Since the change was quite complex I opted not to try to extend support
to Ruby 2.2 and below using the same custom function, and instead I
started anew from the Ruby 2.2 version of the function, applying some
of the same fixes that we have for the modern version.

This gets us closer to a green profiler test suite on 2.2 and 2.1
(there's at least one other compatibility fix needed elsewhere which
will come in a later commit).
…elow

While I hesistate to invest too much on supporting older Rubies, this
seems like a worthy trade-off, because otherwise we'd be stuck
creating alternative versions for every test in `stack_spec.rb`.
ivoanjo added 22 commits May 23, 2022 10:49
This is becoming a common-enough operation that it makes sense to
have a nice wrapper for it.
This is similar to the one we had in
`Datadog::Profiling::Encoding::Profile::Protobuf`, but for the new
libddprof-based `StackRecorder`.

Unfortunately we don't get as many debug stats with libddprof as we
had with the previous code; at some point we may want to take a stab
at exposing that info in libddprof.
Whenever something is dynamically allocated, we must ensure that either
no exceptions can be thrown or that we clean it up before throwing
an exception.
The `TypedData_Get_Struct` API already checks that a Ruby object has
the correct types, so there's no need to do it twice.
Make sure to clean up the `serialized_profile` before calling into
`ruby_time_from`, in case any exceptions are raised.
With everything being cached, it's kinda hard to interactively
experiment inside e.g. a pry session.
…g frame from eval/instance_eval

See comments in `private_vm_api_acccess.c` for the gritty details.
The TL;DR is that we needed to add yet another change to our custom
`rb_profile_frames` in order to match what the Ruby reference
stack trace API returns.
…r specs

This diff looks weirder because of the whitespace changes; I recommend
looking at it with withespace being ignored.
No further changes needed to support this, but I want to add the
extra coverage for weird things so that in the future we avoid any
regressions.
…it for testing the stack collector

(As the creator and sole copyright owner of that file in the
`backtracie` gem, taken from
<https://github.com/ivoanjo/backtracie/blob/1f6116ff5b07e87fe3e5be95a22c6a8e1210de8b/spec/unit/interesting_backtrace_helper.rb>
I am relicensing the file and its copyrights to belong to Datadog
similarly to the rest of the dd-trace-rb codebase).

This file is used as a single mega-complex test for the stack
sampling code, and is quite useful to spot regressions.
What we're doing in these tests is inherently a race: we want the
thread to be stopped at the `sleep`, but have no way to atomically
signal the `ready_queue` and go to `sleep`, and thus we kinda
pray that Ruby just won't schedule us away between both operations.

This seems to work so far, but to avoid creating any flaky tests,
let's try to signal readyness as close as possible to calling `sleep`.

(Note that if this ever gets us into trouble and ends up creating
test flakiness, maybe we could implement a native method that
signals the queue AND calls sleep and ensures the Global VM
Lock is not released.)
As pointed out by Marco in
<#2019 (comment)>
it's possible that when memory allocation fails, we can be in such
a state that casually raising an exception may not work at all.

Instead, let's use a function that Ruby exposes for this exact
situation -- `rb_memerror()` -- which is built to withstand such
memory issues.
As pointed out by Marco during review
(<#2000 (comment)>)
the `xcalloc` function already tests for `NULL` and thus we don't need
to repeat the test.

To make it even more obvious that we're using the Ruby-provided
functions for managing memory, I've decided to directly call
`ruby_xcalloc` and `ruby_xfree` instead of their shorted aliases.
Not that it makes any difference, because they are still globals.
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4904-native-stack-collector-part3 branch from 98fe521 to ce1c971 Compare May 23, 2022 09:49
@ivoanjo
Copy link
Member Author

ivoanjo commented May 23, 2022

One last rebase to solve a last-minute conflict with #1966 and this PR is now on top of master!

Just need to chase down that CI failure relating to max_frames before merging :)

The `frames_omitted_message` CAN be stack allocated but MUST live
as long as the call to `record_sample`.

Bizarrely enough, the previous approach accidentally worked on macOS,
which is why I had not spotted it before -- it worked, and tests
passed, so surely it was ok... right? Nope!
@ivoanjo
Copy link
Member Author

ivoanjo commented May 24, 2022

CI is beautiful green, merging away!

@ivoanjo ivoanjo merged commit 4e9061e into master May 24, 2022
@ivoanjo ivoanjo deleted the ivoanjo/prof-4904-native-stack-collector-part3 branch May 24, 2022 09:41
@github-actions github-actions bot added this to the 1.1.0 milestone May 24, 2022
ivoanjo added a commit that referenced this pull request May 24, 2022
As pointed out by Marco during review
(<#2000 (comment)>)
the `xcalloc` function already tests for `NULL` and thus we don't need
to repeat the test.

To make it even more obvious that we're using the Ruby-provided
functions for managing memory, I've decided to directly call
`ruby_xcalloc` and `ruby_xfree` instead of their shorted aliases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants