Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10588] Support GVL profiling preview on Ruby 3.2 #3939

Merged
merged 9 commits into from
Oct 4, 2024
51 changes: 48 additions & 3 deletions docs/ProfilingDevelopment.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,7 @@ See comments at the top of `collectors_thread_context.c` for an explanation on h

Profiling the Ruby Global VM Lock (GVL) works by using the [GVL instrumentation API](https://github.com/ruby/ruby/pull/5500).

This API currently only works on Ruby 3.2+, although there were a few changes and refinements on Ruby 3.3+ and as of this writing
(September 2024) we _only_ support Ruby 3.3+.
This API currently only works on Ruby 3.2+.

These blog posts are good starting points to understand this API:

Expand Down Expand Up @@ -240,7 +239,7 @@ The weirdest piece of this puzzle is done in the `ThreadContext` collector. Beca

Instead, we ask Ruby to hold the data we need for us using the `rb_internal_thread_specific` API. This API provides
an extremely low-level and limited thread-local storage mechanism, but one that's thread-safe and thus a good match
for our needs. (This API is only available on Ruby 3.3+; to support Ruby 3.2 we'll need to figure out an alternative.)
for our needs. (This API is only available on Ruby 3.3+; to support Ruby 3.2 we use an alternative implementation.)

So, here's how this works: the first time the `ThreadContext` collector sees a thread (e.g. when it creates a context
for it), it uses this API to tag this thread as a thread we want to know about:
Expand Down Expand Up @@ -292,3 +291,49 @@ There's some (lovely?) ASCII art in `handle_gvl_waiting` to explain how we creat
Once a thread is in the "Waiting for GVL" state, then all regular cpu/wall-time samples triggered by the `CpuAndWallTimeWorker`
will continue to mark the thread as being in this state, until `on_gvl_running` + `sample_after_gvl_running` happen and
clear the `per_thread_gvl_waiting_timestamp`, which will make samples revert back to the regular behavior.

### Supporting Ruby 3.2

Supporting GVL profiling on Ruby 3.2 needs special additional work. That's because while in Ruby 3.2 we have the GVL
instrumentation API giving us the events we need to profile the GVL, we're missing:

1. Getting the Ruby thread `VALUE` as an argument in GVL instrumentation API events
2. The `rb_internal_thread_specific` API that allows us to attach in a thread-safe way data to Ruby thread objects

Both 1 and 2 were only introduced in Ruby 3.3, and our implementation of GVL profiling relies on them.

We bridge this gap with alternative implementations for Ruby 3.2:

To solve 1, we're using native level thread-locals (GCC's `__thread`) to keep a pointer to the underlying Ruby `rb_thread_t` structure.

This is more complex than than "just keep it on a thread-local" because:

a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the underlying native thread around for a bit,
and if another Ruby thread is born very quickly after the previous one, Ruby will reuse the native thread and attach it
to the new Ruby thread.

To avoid incorrectly reusing the thread-locals, we install an event hook on Ruby thread start, and make sure to
clean any native thread-locals when a new thread stats.

b) Some of the GVL instrumentation API events are emitted while the thread does not have the GVL and so we need to be
careful when we can and cannot read VM information.

Thus, we only initialize the thread-local during the `RUBY_INTERNAL_THREAD_EVENT_RESUMED` which is emitted while the
thread owns the GVL.

c) Since we don't get the current thread in events, we need to get a bit... creative. Thus, what we do is in
`RUBY_INTERNAL_THREAD_EVENT_RESUMED`, because we know the current thread MUST own the GVL, we read from the internal
Ruby VM state which thread is the GVL owner to find the info we need.

With a + b + c together we are able to keep a pointer to the underlying `rb_thread_t` up-to-date in a native thread
local, thus replacing the need to get a `VALUE thread` as an argument.

To solve 2, we rely on an important observation: there's a `VALUE stat_insn_usage` field inside `rb_thread_t` that's
unused and seems to have effectively been forgotten about.

There's nowhere in the VM code that's writing or reading it (other than marking it for GC), and not even git history
reveals a time where this field was used. I could not find any other references to this field anywhere else. Thus, we
make use of this field to store the information we need, as a replacement for `rb_internal_thread_specific`.

Since presumably Ruby 3.2 will never see this field either removed or used during its remaining maintenance release
period this should work fine, and we have a nice clean solution for 3.3+.
Original file line number Diff line number Diff line change
Expand Up @@ -804,6 +804,10 @@ static VALUE release_gvl_and_run_sampling_trigger_loop(VALUE instance) {

if (state->gvl_profiling_enabled) {
#ifndef NO_GVL_INSTRUMENTATION
#ifdef USE_GVL_PROFILING_3_2_WORKAROUNDS
gvl_profiling_state_thread_tracking_workaround();
#endif

state->gvl_profiling_hook = rb_internal_thread_add_event_hook(
on_gvl_event,
(
Expand Down Expand Up @@ -1304,7 +1308,7 @@ static VALUE _native_resume_signals(DDTRACE_UNUSED VALUE self) {
//
// In fact, the `target_thread` that this event is about may not even be the current thread. (So be careful with thread locals that
// are not directly tied to the `target_thread` object and the like)
VALUE target_thread = event_data->thread;
gvl_profiling_thread target_thread = thread_from_event(event_data);

if (event_id == RUBY_INTERNAL_THREAD_EVENT_READY) { /* waiting for gvl */
thread_context_collector_on_gvl_waiting(target_thread);
Expand All @@ -1315,11 +1319,19 @@ static VALUE _native_resume_signals(DDTRACE_UNUSED VALUE self) {
// it tags threads it's tracking, so if a thread is tagged then by definition we know that thread belongs to the main
// Ractor. Thus, if we really really wanted to access the state, we could do it after making sure we're on the correct Ractor.

#ifdef USE_GVL_PROFILING_3_2_WORKAROUNDS
target_thread = gvl_profiling_state_maybe_initialize();
#endif

bool should_sample = thread_context_collector_on_gvl_running(target_thread);

if (should_sample) {
// should_sample is only true if a thread belongs to the main Ractor, so we're good to go
rb_postponed_job_trigger(after_gvl_running_from_postponed_job_handle);
#ifndef NO_POSTPONED_TRIGGER
rb_postponed_job_trigger(after_gvl_running_from_postponed_job_handle);
#else
rb_postponed_job_register_one(0, after_gvl_running_from_postponed_job, NULL);
#endif
}
} else {
// This is a very delicate time and it's hard for us to raise an exception so let's at least complain to stderr
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#include <ruby.h>
#include <ruby/thread.h>

#include "collectors_thread_context.h"
#include "clock_id.h"
Expand Down Expand Up @@ -80,7 +79,7 @@
// This is used as a placeholder to mark threads that are allowed to be profiled (enabled)
// (e.g. to avoid trying to gvl profile threads that are not from the main Ractor)
// and for which there's no data yet
#define GVL_WAITING_ENABLED_EMPTY INTPTR_MAX
#define GVL_WAITING_ENABLED_EMPTY RUBY_FIXNUM_MAX

static ID at_active_span_id; // id of :@active_span in Ruby
static ID at_active_trace_id; // id of :@active_trace in Ruby
Expand All @@ -92,10 +91,6 @@ static ID at_otel_values_id; // id of :@otel_values in Ruby
static ID at_parent_span_id_id; // id of :@parent_span_id in Ruby
static ID at_datadog_trace_id; // id of :@datadog_trace in Ruby

#ifndef NO_GVL_INSTRUMENTATION
static rb_internal_thread_specific_key_t per_thread_gvl_waiting_timestamp_key;
#endif

// This is used by `thread_context_collector_on_gvl_running`. Because when that method gets called we're not sure if
// it's safe to access the state of the thread context collector, we store this setting as a global value. This does
// mean this setting is shared among all thread context collectors, and thus it's "last writer wins".
Expand Down Expand Up @@ -320,7 +315,7 @@ void collectors_thread_context_init(VALUE profiling_module) {

#ifndef NO_GVL_INSTRUMENTATION
// This will raise if Ruby already ran out of thread-local keys
per_thread_gvl_waiting_timestamp_key = rb_internal_thread_specific_key_create();
gvl_profiling_init();
#endif

gc_profiling_init();
Expand Down Expand Up @@ -1019,7 +1014,7 @@ static void initialize_context(VALUE thread, struct per_thread_context *thread_c
// (Setting this is potentially a race, but what we want is to avoid _stale_ data, so
// if this gets set concurrently with context initialization, then such a value will belong
// to the current profiler instance, so that's OK)
rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) GVL_WAITING_ENABLED_EMPTY);
gvl_profiling_state_thread_object_set(thread, GVL_WAITING_ENABLED_EMPTY);
#endif
}

Expand Down Expand Up @@ -1083,7 +1078,7 @@ static int per_thread_context_as_ruby_hash(st_data_t key_thread, st_data_t value
ID2SYM(rb_intern("gc_tracking.wall_time_at_start_ns")), /* => */ LONG2NUM(thread_context->gc_tracking.wall_time_at_start_ns),

#ifndef NO_GVL_INSTRUMENTATION
ID2SYM(rb_intern("gvl_waiting_at")), /* => */ LONG2NUM((intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key)),
ID2SYM(rb_intern("gvl_waiting_at")), /* => */ LONG2NUM(gvl_profiling_state_thread_object_get(thread)),
#endif
};
for (long unsigned int i = 0; i < VALUE_COUNT(arguments); i += 2) rb_hash_aset(context_as_hash, arguments[i], arguments[i+1]);
Expand Down Expand Up @@ -1558,7 +1553,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self

#ifndef NO_GVL_INSTRUMENTATION
// This function can get called from outside the GVL and even on non-main Ractors
void thread_context_collector_on_gvl_waiting(VALUE thread) {
void thread_context_collector_on_gvl_waiting(gvl_profiling_thread thread) {
// Because this function gets called from a thread that is NOT holding the GVL, we avoid touching the
// per-thread context directly.
//
Expand All @@ -1567,19 +1562,19 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
//
// Also, this function can get called on the non-main Ractor. We deal with this by checking if the value in the context
// is non-zero, since only `initialize_context` ever sets the value from 0 to non-zero for threads it sees.
intptr_t thread_being_profiled = (intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key);
intptr_t thread_being_profiled = gvl_profiling_state_get(thread);
if (!thread_being_profiled) return;

long current_monotonic_wall_time_ns = monotonic_wall_time_now_ns(DO_NOT_RAISE_ON_FAILURE);
if (current_monotonic_wall_time_ns <= 0 || current_monotonic_wall_time_ns > GVL_WAITING_ENABLED_EMPTY) return;

rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) current_monotonic_wall_time_ns);
gvl_profiling_state_set(thread, current_monotonic_wall_time_ns);
}

// This function can get called from outside the GVL and even on non-main Ractors
__attribute__((warn_unused_result))
bool thread_context_collector_on_gvl_running_with_threshold(VALUE thread, uint32_t waiting_for_gvl_threshold_ns) {
intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key);
bool thread_context_collector_on_gvl_running_with_threshold(gvl_profiling_thread thread, uint32_t waiting_for_gvl_threshold_ns) {
intptr_t gvl_waiting_at = gvl_profiling_state_get(thread);

// Thread was not being profiled / not waiting on gvl
if (gvl_waiting_at == 0 || gvl_waiting_at == GVL_WAITING_ENABLED_EMPTY) return false;
Expand All @@ -1595,17 +1590,17 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
// We flip the gvl_waiting_at to negative to mark that the thread is now running and no longer waiting
intptr_t gvl_waiting_at_is_now_running = -gvl_waiting_at;

rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) gvl_waiting_at_is_now_running);
gvl_profiling_state_set(thread, gvl_waiting_at_is_now_running);
} else {
// We decided not to sample. Let's mark the thread back to the initial "enabled but empty" state
rb_internal_thread_specific_set(thread, per_thread_gvl_waiting_timestamp_key, (void *) GVL_WAITING_ENABLED_EMPTY);
gvl_profiling_state_set(thread, GVL_WAITING_ENABLED_EMPTY);
}

return should_sample;
}

__attribute__((warn_unused_result))
bool thread_context_collector_on_gvl_running(VALUE thread) {
bool thread_context_collector_on_gvl_running(gvl_profiling_thread thread) {
return thread_context_collector_on_gvl_running_with_threshold(thread, global_waiting_for_gvl_threshold_ns);
}

Expand Down Expand Up @@ -1643,9 +1638,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self

if (!state->timeline_enabled) rb_raise(rb_eRuntimeError, "GVL profiling requires timeline to be enabled");

struct per_thread_context *thread_context = get_or_create_context_for(current_thread, state);

intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(current_thread, per_thread_gvl_waiting_timestamp_key);
intptr_t gvl_waiting_at = gvl_profiling_state_thread_object_get(current_thread);

if (gvl_waiting_at >= 0) {
// @ivoanjo: I'm not sure if this can ever happen. This means that we're not on the same thread
Expand All @@ -1655,11 +1648,15 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
return Qfalse;
}

struct per_thread_context *thread_context = get_or_create_context_for(current_thread, state);

// We don't actually account for cpu-time during Waiting for GVL. BUT, we may chose to push an
// extra sample to represent the period prior to Waiting for GVL. To support that, we retrieve the current
// cpu-time of the thread and let `update_metrics_and_sample` decide what to do with it.
long cpu_time_for_thread = cpu_time_now_ns(thread_context);

// TODO: Should we update the dynamic sampling rate overhead tracking with this sample as well?

update_metrics_and_sample(
state,
/* thread_being_sampled: */ current_thread,
Expand Down Expand Up @@ -1688,7 +1685,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
sampling_buffer* sampling_buffer,
long current_cpu_time_ns
) {
intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(thread_being_sampled, per_thread_gvl_waiting_timestamp_key);
intptr_t gvl_waiting_at = gvl_profiling_state_thread_object_get(thread_being_sampled);

bool is_gvl_waiting_state = gvl_waiting_at != 0 && gvl_waiting_at != GVL_WAITING_ENABLED_EMPTY;

Expand Down Expand Up @@ -1738,7 +1735,7 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self

if (gvl_waiting_at < 0) {
// Negative means the waiting for GVL just ended, so we clear the state, so next samples no longer represent waiting
rb_internal_thread_specific_set(thread_being_sampled, per_thread_gvl_waiting_timestamp_key, (void *) GVL_WAITING_ENABLED_EMPTY);
gvl_profiling_state_thread_object_set(thread_being_sampled, GVL_WAITING_ENABLED_EMPTY);
}

long gvl_waiting_started_wall_time_ns = labs(gvl_waiting_at);
Expand Down Expand Up @@ -1779,21 +1776,21 @@ static VALUE _native_sample_skipped_allocation_samples(DDTRACE_UNUSED VALUE self
static VALUE _native_on_gvl_waiting(DDTRACE_UNUSED VALUE self, VALUE thread) {
ENFORCE_THREAD(thread);

thread_context_collector_on_gvl_waiting(thread);
thread_context_collector_on_gvl_waiting(thread_from_thread_object(thread));
return Qnil;
}

static VALUE _native_gvl_waiting_at_for(DDTRACE_UNUSED VALUE self, VALUE thread) {
ENFORCE_THREAD(thread);

intptr_t gvl_waiting_at = (intptr_t) rb_internal_thread_specific_get(thread, per_thread_gvl_waiting_timestamp_key);
intptr_t gvl_waiting_at = gvl_profiling_state_thread_object_get(thread);
return LONG2NUM(gvl_waiting_at);
}

static VALUE _native_on_gvl_running(DDTRACE_UNUSED VALUE self, VALUE thread) {
ENFORCE_THREAD(thread);

return thread_context_collector_on_gvl_running(thread) ? Qtrue : Qfalse;
return thread_context_collector_on_gvl_running(thread_from_thread_object(thread)) ? Qtrue : Qfalse;
}

static VALUE _native_sample_after_gvl_running(DDTRACE_UNUSED VALUE self, VALUE collector_instance, VALUE thread) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
#include <ruby.h>
#include <stdbool.h>

#include "gvl_profiling_helper.h"

void thread_context_collector_sample(
VALUE self_instance,
long current_monotonic_wall_time_ns,
Expand All @@ -14,6 +16,9 @@ VALUE thread_context_collector_sample_after_gc(VALUE self_instance);
void thread_context_collector_on_gc_start(VALUE self_instance);
__attribute__((warn_unused_result)) bool thread_context_collector_on_gc_finish(VALUE self_instance);
VALUE enforce_thread_context_collector_instance(VALUE object);
void thread_context_collector_on_gvl_waiting(VALUE thread);
__attribute__((warn_unused_result)) bool thread_context_collector_on_gvl_running(VALUE thread);
VALUE thread_context_collector_sample_after_gvl_running(VALUE self_instance);

#ifndef NO_GVL_INSTRUMENTATION
void thread_context_collector_on_gvl_waiting(gvl_profiling_thread thread);
__attribute__((warn_unused_result)) bool thread_context_collector_on_gvl_running(gvl_profiling_thread thread);
VALUE thread_context_collector_sample_after_gvl_running(VALUE self_instance);
#endif
10 changes: 6 additions & 4 deletions ext/datadog_profiling_native_extension/extconf.rb
Original file line number Diff line number Diff line change
Expand Up @@ -131,10 +131,6 @@ def skip_building_extension!(reason)

have_func "malloc_stats"

# On older Rubies, there was no GVL instrumentation API and APIs created to support it
# TODO: We can probably support Ruby 3.2 as well here, but we haven't done that work yet
$defs << "-DNO_GVL_INSTRUMENTATION" if RUBY_VERSION < "3.3"

# On older Rubies, rb_postponed_job_preregister/rb_postponed_job_trigger did not exist
$defs << "-DNO_POSTPONED_TRIGGER" if RUBY_VERSION < "3.3"

Expand All @@ -147,6 +143,12 @@ def skip_building_extension!(reason)
# On older Rubies, some of the Ractor internal APIs were directly accessible
$defs << "-DUSE_RACTOR_INTERNAL_APIS_DIRECTLY" if RUBY_VERSION < "3.3"

# On older Rubies, there was no GVL instrumentation API and APIs created to support it
$defs << "-DNO_GVL_INSTRUMENTATION" if RUBY_VERSION < "3.2"

# Supporting GVL instrumentation on 3.2 needs some workarounds
$defs << "-DUSE_GVL_PROFILING_3_2_WORKAROUNDS" if RUBY_VERSION.start_with?("3.2")

# On older Rubies, there was no struct rb_native_thread. See private_vm_api_acccess.c for details.
$defs << "-DNO_RB_NATIVE_THREAD" if RUBY_VERSION < "3.2"

Expand Down
Loading
Loading