[PROF-10128] Refactor and speed up profiler stack sampling #3837

ivoanjo · 2024-08-09T11:39:39Z

What does this PR do?

This PR refactors the way the profiler samples the Ruby stack, and then builds on that refactoring to speed up sampling.

It also opens the door to future speedups once libdatadog allows us to skip re-interning strings, frames and stacks that we know have been sampled before.

Specifically this PR does the following:

It tweaks the data that ddtrace_rb_profile_frames returns from sampling the Ruby stack so that the stack collector can use cheaper APIs to get the same information as before.
It changes the sampling buffer we put stack traces in to be per-thread. This allows us to reuse some of the work we do from sample to sample, in cases where frames on the stack are the same. Thus sampling the same stack over and over will be slightly cheaper.
It adds a few more micro-optimizations that I spotted from profiling the profiler :)
When we sample stacks, we expose a flag that can be used to determine if a frame was unchanged from the last sample. This is what opens the door for future optimizations using libdatadog.

Here's the results of running benchmarks/profiler_sample_loop_v2.rb with these changes on my local machine:

Warming up --------------------------------------
stack collector improved-sampling
                       783.000 i/100ms
Calculating -------------------------------------
stack collector improved-sampling
                          7.828k (± 3.3%) i/s -    234.900k in  30.042969s

Warming up --------------------------------------
stack collector master
                       746.000 i/100ms
Calculating -------------------------------------
stack collector master
                          7.057k (± 3.3%) i/s -    211.864k in  30.058848s

Comparison:
stack collector improved-sampling:     7827.8 i/s
stack collector master:     7056.8 i/s - 1.11x  slower

Motivation:

This PR further reduces the overhead of profiling, especially more complex/deep stacks. I started looking into this with the goal of improving allocation profiling, but stack sampling is shared with cpu/wall-time profiling as well, so those will also be able to take advantage of the improvement.

Additional Notes:

There's a lot of small parts moving around in this PR. I recommend reviewing commit-by-commit, as I tried to break every change into a small standalone step that can be reviewed in a much nicer way.

How to test the change?

Our existing tests cover these changes.

They were out-of-sync >_>

This will be more useful later once freeing a per_thread_context gets more complex.

Recording a placeholder stack does not involve any sampling, so it doesn't actually need a sampling_buffer.

This will be used by the thread context collector to validate max frames during initialization, separately from creating new sampling buffers.

This doesn't yet change anything; this refactor will later allow us to reuse data from the sampling_buffers from sample to sample. This does slightly increase memory usage, but not by as much as may be expected. Ignoring the locations (these will get shared in a later commit), a sampling buffer has max_frames * (VALUE, int, bool) which is less than 10KiB per thread.

The locations, unlike the rest of the sampling buffer, do not (yet?) need to be per-thread, so let's save some memory.

Not really to save memory, but semantically it's closer to what we want.

Instead of storing the `last_ruby_frame` just to later need to re-retrieve the filename, let's directly store the `last_ruby_frame_filename`.

As explained in the added comment, for our use this is expected to be equivalent (and our integration tests that compare our output with the Ruby stack trace APIs still pass). This gets us a bit less of work during sampling (but not a lot), but more importantly it will make it easier to add caching in a later commit.

Rather than having separate arrays for each kind of information we want to get from a stack (VALUE *stack_buffer + int *lines_buffer + bool *is_ruby_frame), introduce a new `frame_info` struct to contain this information. This makes it easy to then add or change members in this struct without having to think about allocating yet more arrays, and managing their lifetimes and passing them in as arguments. (Spoiler: We'll be using this in a commit very soon to introduce caching) As a side bonus of this change, I noticed that we could just expose the method name for native methods as an `ID`, rather than handing over the `rb_callable_method_entry_t` ("cme") as we did before. This has a few advantages: * It's effectively a way to inline the work done by the previous call to `ddtrace_rb_profile_frame_method_name`. Because we had to copy/paste some code from the Ruby sources to support `ddtrace_rb_profile_frame_method_name` on legacy Rubies, we no longer need that code and can delete it! * An `ID` is interned/immutable, so it's much easier to later cache if we want to. (Although no plans yet...)

The intuition here is that in many apps, the same stack (or at least the bottom of the stack) won't change very often -- e.g. during an HTTP request we may take a bunch of samples, and all of them will have a common stack that's webserver + rails routing + endpoint doing work. While right now this mostly avoids us repeating calls to `calc_lineno`, (calculating the line number for Ruby frames) one of the most interesting changes in this PR is the new `same_frame` flag. This `same_frame` flag allows the caller to know when the stack has changed, and I'm hoping to later connect this to libdatadog improvements, e.g. if we store the ids for each stack entry in libdatadog, we can know if the stack changed or not, and thus if we can just re-feed the same ids to libdatadog or need to record new ones.

Looking at the implementation of `StringValuePtr`, it does a bunch of checks that we don't really need because we already have the `ENFORCE_TYPE` making sure we do have a string. So we can just use the string without further ado.

I don't often use this setting so it had bitrotten a bit 😅 * `sampler_readjust` no longer has the `readjustment_reason`. We can recompute for debugging if needed, but for now I just removed the reference * When building at `-O0` it looks like my version of gcc is refusing to inline `monotonic_wall_time_now_ns` which is a problem because there's no non-inline version of that function, and thus it breaks running the tests with > symbol lookup error: datadog_profiling_native_extension.3.1.4_x86_64-linux.so: undefined symbol: monotonic_wall_time_now_ns `-O1` is probably fine for development as well so let's go with it for now?

…ile_ stuff Since we now always return an iseq from ddtrace_rb_profile_frames (it's in the name of the struct element even!), we can directly use the `rb_iseq_` functions. Looking at the `rb_profile_` functions what they do is check if they have an iseq or something else, and when they have an iseq, they call these functions. So this allows us to skip the extra tests because we know we have iseqs. (FYI an "iseq" is an "instruction sequence" -- it's the object that represents the YARV VM bytecode)

This was a really weird one. I don't like handwaving it away, but debugging any more of this would need me to go really deep into Ruby's GC to understand where exactly Ruby is deciding to keep this object alive and I'm not sure such a time investment is worth it for this bug, so I've put in a workaround for now.

pr-commenter · 2024-08-09T11:50:35Z

Benchmarks

Benchmark execution time: 2024-08-13 11:08:39

Comparing candidate commit 607d5d5 in PR branch ivoanjo/prof-10128-stack-sampling-improvements with baseline commit 850462b in branch master.

Found 3 performance improvements and 0 performance regressions! Performance is the same for 20 metrics, 2 unstable metrics.

scenario:sample timeline=false

🟩 throughput [+0.464op/s; +0.498op/s] or [+6.796%; +7.293%]

scenario:sample+serialize retain_every=10 heap_samples=false heap_size=false heap_sample_every=1 skip_end_gc=false

🟩 throughput [+0.174op/s; +0.185op/s] or [+6.692%; +7.117%]

scenario:stack collector

🟩 throughput [+207.015op/s; +208.782op/s] or [+7.565%; +7.630%]

marcotc · 2024-08-09T19:54:03Z

ext/datadog_profiling_native_extension/collectors_stack.c

-  buffer->is_ruby_frame = ruby_xcalloc(max_frames, sizeof(bool));
-  buffer->locations     = ruby_xcalloc(max_frames, sizeof(ddog_prof_Location));
+  buffer->locations = locations;
+  buffer->stack_buffer = ruby_xcalloc(max_frames, sizeof(frame_info));


Nice consolidation here!

marcotc · 2024-08-09T20:14:10Z

ext/datadog_profiling_native_extension/private_vm_api_access.c

+            //
+            // cme = rb_vm_frame_method_entry(cfp);
+
+            // if (cme && cme->def->type == VM_METHOD_TYPE_ISEQ &&


Can you add a comment here explicitly explaining what this commented out code section means (so we know when to restore it if needed, or remove it)

Good idea, left note in e6c9122

marcotc · 2024-08-09T20:24:15Z

ext/libdatadog_api/extconf.rb

@@ -67,7 +67,7 @@ def skip_building_extension!(reason)

 if ENV['DDTRACE_DEBUG'] == 'true'
  $defs << '-DDD_DEBUG'
-  CONFIG['optflags'] = '-O0'


Is O0 not necessary to get the debug information we need?

I've used "-O1" with gdb and didn't see any issue/missing info, so I think it's fine, and it's easy to tweak locally -- I just wanted to avoid a default config that was broken (at least on my Ubuntu/gcc version?).

marcotc

Looks great overall!

codecov-commenter · 2024-08-12T13:54:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.84%. Comparing base (850462b) to head (607d5d5).
Report is 18 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3837      +/-   ##
==========================================
- Coverage   97.85%   97.84%   -0.01%     
==========================================
  Files        1264     1264              
  Lines       75731    75732       +1     
  Branches     3719     3719              
==========================================
- Hits        74103    74099       -4     
- Misses       1628     1633       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AlexJF · 2024-08-12T12:50:43Z

ext/datadog_profiling_native_extension/private_vm_api_access.h

+    struct {
+      ID method_id;
+    } native_frame;
+  } as;


Such fluent! Much API! Wow!

Something something great artists copy.

AlexJF · 2024-08-12T13:02:57Z

ext/datadog_profiling_native_extension/private_vm_api_access.c

@@ -479,6 +480,16 @@ int ddtrace_rb_profile_frames(VALUE thread, int start, int limit, frame_info *st
                continue;
            }


So in the past, the same cfp would "eat" all the start whereas now it'll skip just one? Not sure if this was ever something we relied on but just highlighting this slight change in behaviour.

This is a really great catch!

I completely glossed over it because a) we don't use start and b) I remembered some discussion somewhere mentioning it was actually broken.

It's actually fixed nowadays https://bugs.ruby-lang.org/issues/14607 // ruby/ruby#8280 . I had spotted that upstream had moved the cfp = ... to the for header as well, but I hadn't spotted it was to fix start.

I did spot that upstream also added the start skip logic to the if (cme && cme->def->type == VM_METHOD_TYPE_CFUNC) branch below. I'll push a commit to add this one as well, so we keep closer to upstream in case we ever want to use start.

AlexJF · 2024-08-12T13:59:27Z

spec/datadog/profiling/stack_recorder_spec.rb

+        # I've bisected this and undoing 3d4b7fcf30b529b191ca737ae13629eb27b8ab63 also makes the flakiness
+        # go away, but again, that change doesn't seem to have anything to do with GC.
+        # As the weird behavior is transitory, e.g. it provably goes away on the next GC, I'll go with this
+        # workaround for now.


Yeah I seem to remember Ruby taking into account register values in its heuristic to detect potential usages of objects. May just be that with the extra caching of things like iseq and what not we make it slightly more likely to hit these edge cases?

Yeah this is my suspicion -- Ruby's GC is conservative, in that if anything "looks" like a VALUE, it assumes it is. Actually debugging this would mean going into the GC marking code and understand why exactly it's making this decision which is somewhat of a time investment which is why I took the slightly-easier way out 😅.

We haven't used the `start` parameter so far, but let's keep closer to upstream here. (We had already half-applied the change from upstream for another reason, so now we have the full fix).

ivoanjo added 15 commits August 8, 2024 11:56

Minor: Make argument name in header match actual function

d5062c5

They were out-of-sync >_>

Add helper function to clear per_thread_context

9c61d60

This will be more useful later once freeing a per_thread_context gets more complex.

Remove sampling_buffer argument from record_placeholder_stack

3d4b7fc

Recording a placeholder stack does not involve any sampling, so it doesn't actually need a sampling_buffer.

Extract checking max_frames into a separate function

5fd0460

This will be used by the thread context collector to validate max frames during initialization, separately from creating new sampling buffers.

Avoid allocating locations for every sampling_buffer/thread

99d5915

The locations, unlike the rest of the sampling buffer, do not (yet?) need to be per-thread, so let's save some memory.

Use smaller type for max_frames

05e552c

Not really to save memory, but semantically it's closer to what we want.

Avoid repeating call to rb_profile_frame_path for cframes

71839e0

Instead of storing the `last_ruby_frame` just to later need to re-retrieve the filename, let's directly store the `last_ruby_frame_filename`.

ivoanjo requested a review from a team as a code owner August 9, 2024 11:39

github-actions bot added the profiling Involves Datadog profiling label Aug 9, 2024

ivoanjo requested a review from AlexJF August 9, 2024 11:40

marcotc reviewed Aug 9, 2024

View reviewed changes

marcotc approved these changes Aug 9, 2024

View reviewed changes

Clarify why code is being left commented out

e6c9122

AlexJF approved these changes Aug 12, 2024

View reviewed changes

Pick up upstream fix for start parameter

607d5d5

We haven't used the `start` parameter so far, but let's keep closer to upstream here. (We had already half-applied the change from upstream for another reason, so now we have the full fix).

ivoanjo force-pushed the ivoanjo/prof-10128-stack-sampling-improvements branch from eb5af8c to 607d5d5 Compare August 13, 2024 10:31

ivoanjo merged commit 5296456 into master Aug 13, 2024
186 checks passed

ivoanjo deleted the ivoanjo/prof-10128-stack-sampling-improvements branch August 13, 2024 11:09

github-actions bot added this to the 2.3.0 milestone Aug 13, 2024

TonyCTHsu mentioned this pull request Aug 22, 2024

Bump to version 2.3.0 #3861

Merged

ivoanjo added the performance Involves performance (e.g. CPU, memory, etc) label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-10128] Refactor and speed up profiler stack sampling #3837

[PROF-10128] Refactor and speed up profiler stack sampling #3837

ivoanjo commented Aug 9, 2024

pr-commenter bot commented Aug 9, 2024 •

edited

Loading

marcotc Aug 9, 2024

marcotc Aug 9, 2024

ivoanjo Aug 12, 2024

marcotc Aug 9, 2024

ivoanjo Aug 12, 2024

marcotc left a comment

codecov-commenter commented Aug 12, 2024 •

edited

Loading

AlexJF Aug 12, 2024

ivoanjo Aug 13, 2024

AlexJF Aug 12, 2024 •

edited

Loading

ivoanjo Aug 13, 2024

AlexJF Aug 12, 2024 •

edited

Loading

ivoanjo Aug 13, 2024

		@@ -479,6 +480,16 @@ int ddtrace_rb_profile_frames(VALUE thread, int start, int limit, frame_info *st
		continue;
		}

[PROF-10128] Refactor and speed up profiler stack sampling #3837

[PROF-10128] Refactor and speed up profiler stack sampling #3837

Conversation

ivoanjo commented Aug 9, 2024

pr-commenter bot commented Aug 9, 2024 • edited Loading

Benchmarks

scenario:sample timeline=false

scenario:sample+serialize retain_every=10 heap_samples=false heap_size=false heap_sample_every=1 skip_end_gc=false

scenario:stack collector

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcotc left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 12, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexJF Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexJF Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pr-commenter bot commented Aug 9, 2024 •

edited

Loading

codecov-commenter commented Aug 12, 2024 •

edited

Loading

AlexJF Aug 12, 2024 •

edited

Loading

AlexJF Aug 12, 2024 •

edited

Loading