DataDog · ivoanjo · Oct 4, 2024 · Sep 20, 2024 · Sep 20, 2024 · Sep 23, 2024
@@ -151,8 +151,7 @@ See comments at the top of `collectors_thread_context.c` for an explanation on h
 
 Profiling the Ruby Global VM Lock (GVL) works by using the [GVL instrumentation API](https://github.com/ruby/ruby/pull/5500).
 
-This API currently only works on Ruby 3.2+, although there were a few changes and refinements on Ruby 3.3+ and as of this writing
-(September 2024) we _only_ support Ruby 3.3+.
+This API currently only works on Ruby 3.2+.
 
 These blog posts are good starting points to understand this API:
 
@@ -240,7 +239,7 @@ The weirdest piece of this puzzle is done in the `ThreadContext` collector. Beca
 
 Instead, we ask Ruby to hold the data we need for us using the `rb_internal_thread_specific` API. This API provides
 an extremely low-level and limited thread-local storage mechanism, but one that's thread-safe and thus a good match
-for our needs. (This API is only available on Ruby 3.3+; to support Ruby 3.2 we'll need to figure out an alternative.)
+for our needs. (This API is only available on Ruby 3.3+; to support Ruby 3.2 we use an alternative implementation.)
 
 So, here's how this works: the first time the `ThreadContext` collector sees a thread (e.g. when it creates a context
 for it), it uses this API to tag this thread as a thread we want to know about:
@@ -292,3 +291,49 @@ There's some (lovely?) ASCII art in `handle_gvl_waiting` to explain how we creat
 Once a thread is in the "Waiting for GVL" state, then all regular cpu/wall-time samples triggered by the `CpuAndWallTimeWorker`
 will continue to mark the thread as being in this state, until `on_gvl_running` + `sample_after_gvl_running` happen and
 clear the `per_thread_gvl_waiting_timestamp`, which will make samples revert back to the regular behavior.
+
+### Supporting Ruby 3.2
+
+Supporting GVL profiling on Ruby 3.2 needs special additional work. That's because while in Ruby 3.2 we have the GVL
+instrumentation API giving us the events we need to profile the GVL, we're missing:
+
+1. Getting the Ruby thread `VALUE` as an argument in GVL instrumentation API events
+2. The `rb_internal_thread_specific` API that allows us to attach in a thread-safe way data to Ruby thread objects
+
+Both 1 and 2 were only introduced in Ruby 3.3, and our implementation of GVL profiling relies on them.
+
+We bridge this gap with alternative implementations for Ruby 3.2:
+
+To solve 1, we're using native level thread-locals (GCC's `__thread`) to keep a pointer to the underlying Ruby `rb_thread_t` structure.
+
+This is more complex than than "just keep it on a thread-local" because:
+
+a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the underlying native thread around for a bit,
+and if another Ruby thread is born very quickly after the previous one, Ruby will reuse the native thread and attach it
+to the new Ruby thread.
+
+   To avoid incorrectly reusing the thread-locals, we install an event hook on Ruby thread start, and make sure to
+   clean any native thread-locals when a new thread stats.
+
+b) Some of the GVL instrumentation API events are emitted while the thread does not have the GVL and so we need to be
+careful when we can and cannot read VM information.
+
+   Thus, we only initialize the thread-local during the `RUBY_INTERNAL_THREAD_EVENT_RESUMED` which is emitted while the
+   thread owns the GVL.
+
+c) Since we don't get the current thread in events, we need to get a bit... creative. Thus, what we do is in
+`RUBY_INTERNAL_THREAD_EVENT_RESUMED`, because we know the current thread MUST own the GVL, we read from the internal
+Ruby VM state which thread is the GVL owner to find the info we need.
+
+With a + b + c together we are able to keep a pointer to the underlying `rb_thread_t` up-to-date in a native thread
+local, thus replacing the need to get a `VALUE thread` as an argument.
+
+To solve 2, we rely on an important observation: there's a `VALUE stat_insn_usage` field inside `rb_thread_t` that's
+unused and seems to have effectively been forgotten about.
+
+There's nowhere in the VM code that's writing or reading it (other than marking it for GC), and not even git history
+reveals a time where this field was used. I could not find any other references to this field anywhere else. Thus, we
+make use of this field to store the information we need, as a replacement for `rb_internal_thread_specific`.
+
+Since presumably Ruby 3.2 will never see this field either removed or used during its remaining maintenance release
+period this should work fine, and we have a nice clean solution for 3.3+.
@@ -804,6 +804,10 @@ static VALUE release_gvl_and_run_sampling_trigger_loop(VALUE instance) {
 
   if (state->gvl_profiling_enabled) {
     #ifndef NO_GVL_INSTRUMENTATION
+      #ifdef USE_GVL_PROFILING_3_2_WORKAROUNDS
+        gvl_profiling_state_thread_tracking_workaround();
+      #endif
+
       state->gvl_profiling_hook = rb_internal_thread_add_event_hook(
         on_gvl_event,
         (
@@ -1304,7 +1308,7 @@ static VALUE _native_resume_signals(DDTRACE_UNUSED VALUE self) {
     //
     // In fact, the `target_thread` that this event is about may not even be the current thread. (So be careful with thread locals that
     // are not directly tied to the `target_thread` object and the like)
-    VALUE target_thread = event_data->thread;
+    gvl_profiling_thread target_thread = thread_from_event(event_data);
 
     if (event_id == RUBY_INTERNAL_THREAD_EVENT_READY) { /* waiting for gvl */
       thread_context_collector_on_gvl_waiting(target_thread);
@@ -1315,11 +1319,19 @@ static VALUE _native_resume_signals(DDTRACE_UNUSED VALUE self) {
       // it tags threads it's tracking, so if a thread is tagged then by definition we know that thread belongs to the main
       // Ractor. Thus, if we really really wanted to access the state, we could do it after making sure we're on the correct Ractor.
 
+      #ifdef USE_GVL_PROFILING_3_2_WORKAROUNDS
+        target_thread = gvl_profiling_state_maybe_initialize();
+      #endif
+
       bool should_sample = thread_context_collector_on_gvl_running(target_thread);
 
       if (should_sample) {
         // should_sample is only true if a thread belongs to the main Ractor, so we're good to go
-        rb_postponed_job_trigger(after_gvl_running_from_postponed_job_handle);
+        #ifndef NO_POSTPONED_TRIGGER
+          rb_postponed_job_trigger(after_gvl_running_from_postponed_job_handle);
+        #else
+          rb_postponed_job_register_one(0, after_gvl_running_from_postponed_job, NULL);
+        #endif
       }
     } else {
       // This is a very delicate time and it's hard for us to raise an exception so let's at least complain to stderr

@@ -1,5 +1,4 @@
 #include <ruby.h>
-#include <ruby/thread.h>
 
 #include "collectors_thread_context.h"
 #include "clock_id.h"
@@ -80,7 +79,7 @@
 // This is used as a placeholder to mark threads that are allowed to be profiled (enabled)
 // (e.g. to avoid trying to gvl profile threads that are not from the main Ractor)
 // and for which there's no data yet
-#define GVL_WAITING_ENABLED_EMPTY INTPTR_MAX
+#define GVL_WAITING_ENABLED_EMPTY RUBY_FIXNUM_MAX
 
 static ID at_active_span_id;  // id of :@active_span in Ruby
 static ID at_active_trace_id; // id of :@active_trace in Ruby
@@ -92,10 +91,6 @@ static ID at_otel_values_id;  // id of :@otel_values in Ruby
 static ID at_parent_span_id_id; // id of :@parent_span_id in Ruby
 static ID at_datadog_trace_id;  // id of :@datadog_trace in Ruby
 
-#ifndef NO_GVL_INSTRUMENTATION
-static rb_internal_thread_specific_key_t per_thread_gvl_waiting_timestamp_key;
-#endif
-
 // This is used by `thread_context_collector_on_gvl_running`. Because when that method gets called we're not sure if
 // it's safe to access the state of the thread context collector, we store this setting as a global value. This does
 // mean this setting is shared among all thread context collectors, and thus it's "last writer wins".
@@ -320,7 +315,7 @@ void collectors_thread_context_init(VALUE profiling_module) {
 
   #ifndef NO_GVL_INSTRUMENTATION
     // This will raise if Ruby already ran out of thread-local keys
-    per_thread_gvl_waiting_timestamp_key = rb_internal_thread_specific_key_create();
+    gvl_profiling_init();
   #endif
 
   gc_profiling_init();
@@ -1019,7 +1014,7 @@ static void initialize_context(VALUE thread, struct per_thread_context *thread_c
     // (Setting this is potentially a race, but what we want is to avoid _stale_ data, so
     // if this gets set concurrently with context initialization, then such a value will belong
     // to the current profiler instance, so that's OK)
-    rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) GVL_WAITING_ENABLED_EMPTY);
+    gvl_profiling_state_thread_object_set(thread, GVL_WAITING_ENABLED_EMPTY);
   #endif
 }
 
@@ -1083,7 +1078,7 @@ static int per_thread_context_as_ruby_hash(st_data_t key_thread, st_data_t value
     ID2SYM(rb_intern("gc_tracking.wall_time_at_start_ns")),  /* => */ LONG2NUM(thread_context->gc_tracking.wall_time_at_start_ns),
 
     #ifndef NO_GVL_INSTRUMENTATION
-      ID2SYM(rb_intern("gvl_waiting_at")), /* => */ LONG2NUM((intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key)),
+      ID2SYM(rb_intern("gvl_waiting_at")), /* => */ LONG2NUM(gvl_profiling_state_thread_object_get(thread)),
     #endif
   };
   for (long unsigned int i = 0; i < VALUE_COUNT(arguments); i += 2) rb_hash_aset(context_as_hash, arguments[i], arguments[i+1]);
@@ -1558,7 +1553,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
 
 #ifndef NO_GVL_INSTRUMENTATION
   // This function can get called from outside the GVL and even on non-main Ractors
-  void thread_context_collector_on_gvl_waiting(VALUE thread) {
+  void thread_context_collector_on_gvl_waiting(gvl_profiling_thread thread) {
     // Because this function gets called from a thread that is NOT holding the GVL, we avoid touching the
     // per-thread context directly.
     //
@@ -1567,19 +1562,19 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
     //
     // Also, this function can get called on the non-main Ractor. We deal with this by checking if the value in the context
     // is non-zero, since only `initialize_context` ever sets the value from 0 to non-zero for threads it sees.
-    intptr_t thread_being_profiled = (intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key);
+    intptr_t thread_being_profiled = gvl_profiling_state_get(thread);
     if (!thread_being_profiled) return;
 
     long current_monotonic_wall_time_ns = monotonic_wall_time_now_ns(DO_NOT_RAISE_ON_FAILURE);
     if (current_monotonic_wall_time_ns <= 0 || current_monotonic_wall_time_ns > GVL_WAITING_ENABLED_EMPTY) return;
 
-    rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) current_monotonic_wall_time_ns);
+    gvl_profiling_state_set(thread, current_monotonic_wall_time_ns);
   }
 
   // This function can get called from outside the GVL and even on non-main Ractors
   __attribute__((warn_unused_result))
-  bool thread_context_collector_on_gvl_running_with_threshold(VALUE thread, uint32_t waiting_for_gvl_threshold_ns) {
-    intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key);
+  bool thread_context_collector_on_gvl_running_with_threshold(gvl_profiling_thread thread, uint32_t waiting_for_gvl_threshold_ns) {
+    intptr_t gvl_waiting_at = gvl_profiling_state_get(thread);
 
     // Thread was not being profiled / not waiting on gvl
     if (gvl_waiting_at == 0 || gvl_waiting_at == GVL_WAITING_ENABLED_EMPTY) return false;
@@ -1595,17 +1590,17 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
       // We flip the gvl_waiting_at to negative to mark that the thread is now running and no longer waiting
       intptr_t gvl_waiting_at_is_now_running = -gvl_waiting_at;
 
-      rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) gvl_waiting_at_is_now_running);
+      gvl_profiling_state_set(thread, gvl_waiting_at_is_now_running);
     } else {
       // We decided not to sample. Let's mark the thread back to the initial "enabled but empty" state
-      rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) GVL_WAITING_ENABLED_EMPTY);
+      gvl_profiling_state_set(thread, GVL_WAITING_ENABLED_EMPTY);
     }
 
     return should_sample;
   }
 
   __attribute__((warn_unused_result))
-  bool thread_context_collector_on_gvl_running(VALUE thread) {
+  bool thread_context_collector_on_gvl_running(gvl_profiling_thread thread) {
     return thread_context_collector_on_gvl_running_with_threshold(thread, global_waiting_for_gvl_threshold_ns);
   }
 
@@ -1643,9 +1638,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
 
     if (!state->timeline_enabled) rb_raise(rb_eRuntimeError, "GVL profiling requires timeline to be enabled");
 
-    struct per_thread_context *thread_context = get_or_create_context_for(current_thread, state);
-
-    intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(current_thread, per_thread_gvl_waiting_timestamp_key);
+    intptr_t gvl_waiting_at = gvl_profiling_state_thread_object_get(current_thread);
 
     if (gvl_waiting_at >= 0) {
       // @ivoanjo: I'm not sure if this can ever happen. This means that we're not on the same thread
@@ -1655,11 +1648,15 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
       return Qfalse;
     }
 
+    struct per_thread_context *thread_context = get_or_create_context_for(current_thread, state);
+
     // We don't actually account for cpu-time during Waiting for GVL. BUT, we may chose to push an
     // extra sample to represent the period prior to Waiting for GVL. To support that, we retrieve the current
     // cpu-time of the thread and let `update_metrics_and_sample` decide what to do with it.
     long cpu_time_for_thread = cpu_time_now_ns(thread_context);
 
+    // TODO: Should we update the dynamic sampling rate overhead tracking with this sample as well?
+
     update_metrics_and_sample(
       state,
       /* thread_being_sampled: */ current_thread,
@@ -1688,7 +1685,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
     sampling_buffer* sampling_buffer,
     long current_cpu_time_ns
   ) {
-    intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(thread_being_sampled, per_thread_gvl_waiting_timestamp_key);
+    intptr_t gvl_waiting_at = gvl_profiling_state_thread_object_get(thread_being_sampled);
 
     bool is_gvl_waiting_state = gvl_waiting_at != 0 && gvl_waiting_at != GVL_WAITING_ENABLED_EMPTY;
 
@@ -1738,7 +1735,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
 
     if (gvl_waiting_at < 0) {
       // Negative means the waiting for GVL just ended, so we clear the state, so next samples no longer represent waiting
-      rb_internal_thread_specific_set(thread_being_sampled, per_thread_gvl_waiting_timestamp_key, (void *) GVL_WAITING_ENABLED_EMPTY);
+      gvl_profiling_state_thread_object_set(thread_being_sampled, GVL_WAITING_ENABLED_EMPTY);
     }
 
     long gvl_waiting_started_wall_time_ns = labs(gvl_waiting_at);
@@ -1779,21 +1776,21 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
   static VALUE _native_on_gvl_waiting(DDTRACE_UNUSED VALUE self, VALUE thread) {
     ENFORCE_THREAD(thread);
 
-    thread_context_collector_on_gvl_waiting(thread);
+    thread_context_collector_on_gvl_waiting(thread_from_thread_object(thread));
     return Qnil;
   }
 
   static VALUE _native_gvl_waiting_at_for(DDTRACE_UNUSED VALUE self, VALUE thread) {
     ENFORCE_THREAD(thread);
 
-    intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key);
+    intptr_t gvl_waiting_at = gvl_profiling_state_thread_object_get(thread);
     return LONG2NUM(gvl_waiting_at);
   }
 
   static VALUE _native_on_gvl_running(DDTRACE_UNUSED VALUE self, VALUE thread) {
     ENFORCE_THREAD(thread);
 
-    return thread_context_collector_on_gvl_running(thread) ? Qtrue : Qfalse;
+    return thread_context_collector_on_gvl_running(thread_from_thread_object(thread)) ? Qtrue : Qfalse;
   }
 
   static VALUE _native_sample_after_gvl_running(DDTRACE_UNUSED VALUE self, VALUE collector_instance, VALUE thread) {

@@ -3,6 +3,8 @@
 #include <ruby.h>
 #include <stdbool.h>
 
+#include "gvl_profiling_helper.h"
+
 void thread_context_collector_sample(
   VALUE self_instance,
   long current_monotonic_wall_time_ns,
@@ -14,6 +16,9 @@ VALUE thread_context_collector_sample_after_gc(VALUE self_instance);
 void thread_context_collector_on_gc_start(VALUE self_instance);
 __attribute__((warn_unused_result)) bool thread_context_collector_on_gc_finish(VALUE self_instance);
 VALUE enforce_thread_context_collector_instance(VALUE object);
-void thread_context_collector_on_gvl_waiting(VALUE thread);
-__attribute__((warn_unused_result)) bool thread_context_collector_on_gvl_running(VALUE thread);
-VALUE thread_context_collector_sample_after_gvl_running(VALUE self_instance);
+
+#ifndef NO_GVL_INSTRUMENTATION
+  void thread_context_collector_on_gvl_waiting(gvl_profiling_thread thread);
+  __attribute__((warn_unused_result)) bool thread_context_collector_on_gvl_running(gvl_profiling_thread thread);
+  VALUE thread_context_collector_sample_after_gvl_running(VALUE self_instance);
+#endif
@@ -131,10 +131,6 @@ def skip_building_extension!(reason)
 
 have_func "malloc_stats"
 
-# On older Rubies, there was no GVL instrumentation API and APIs created to support it
-# TODO: We can probably support Ruby 3.2 as well here, but we haven't done that work yet
-$defs << "-DNO_GVL_INSTRUMENTATION" if RUBY_VERSION < "3.3"
-
 # On older Rubies, rb_postponed_job_preregister/rb_postponed_job_trigger did not exist
 $defs << "-DNO_POSTPONED_TRIGGER" if RUBY_VERSION < "3.3"
 
@@ -147,6 +143,12 @@ def skip_building_extension!(reason)
 # On older Rubies, some of the Ractor internal APIs were directly accessible
 $defs << "-DUSE_RACTOR_INTERNAL_APIS_DIRECTLY" if RUBY_VERSION < "3.3"
 
+# On older Rubies, there was no GVL instrumentation API and APIs created to support it
+$defs << "-DNO_GVL_INSTRUMENTATION" if RUBY_VERSION < "3.2"
+
+# Supporting GVL instrumentation on 3.2 needs some workarounds
+$defs << "-DUSE_GVL_PROFILING_3_2_WORKAROUNDS" if RUBY_VERSION.start_with?("3.2")
+
 # On older Rubies, there was no struct rb_native_thread. See private_vm_api_acccess.c for details.
 $defs << "-DNO_RB_NATIVE_THREAD" if RUBY_VERSION < "3.2"