[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

ivoanjo · 2022-11-15T17:29:14Z

What does this PR do?:

This PR is a follow-up from work started in #2359 and continued in #2362 for fixing and improving support for Ruby VM forks in the new profiler.

When a Ruby application forks (such as the puma webserver in multiprocess mode), we automatically restart the profiler in the child processes.

But, as part of restarting we should also make sure to reset any state that is leftover from the parent process.

This PR makes sure that, for the new profiler:

Resetting the state is done in a safe way (e.g. without concurrency issues). This required a refactoring of how we do it.
There is no leftover state from the parent.

While the old profiler codepath was also affected by the refactoring in this PR, it was already doing the two steps above correctly; the changes in this PR are just to make it easier to support the feature in the new profiler.

Motivation:

It's common for Ruby apps to make use of fork, and this is a configuration we expect to continue to support in the new Ruby profiler.

Additional Notes:

This PR is easier to review commit-by-commit.

How to test the change?:

Besides the included test coverage, this can be manually seen in action by validating that the profiler (both on the old codepath and the new codepath) restarts correctly on child processes after a fork and that profiles reported by those child processes are correct and do not contain left-over state from the parent.

Our `Thread` monkey patch, including `#update_native_ids` was removed in #1740. This code was left over in `setup.rb` since we had a fallback for `Thread`s without the monkey patch, but effectively never runs.

For the old profiler codepath, there's 3 pieces of state that need to be reset after a fork: 1. The profiling data inside the `OldRecorder` 2. The `@last_flush_finish_at` timestamp inside the `Exporter` 3. The cpu-time tracking inside the `Collectors::OldStack` (unchanged in this PR) Previously, 1) and 2) were triggered "indirectly" after a fork because the `Profiler` called `Scheduler#start`, which due to shared Datadog workers code calledd `Scheduler#after_fork`, which then called `Exporter#clear`. This worked fine, but for the new profiler codepath, this posed a few complications -- in particular, that cleaning up 1) and 2) could happen after the collector had already restarted so would need to take into account the potential concurrency issues. To simplify the new profiler codepath, let's instead make all of the state resetting: a) Explicit behind a call to `#reset_after_fork` that every component gets b) Sequential -- by making sure that `#reset_after_fork` gets called in the collectors before the scheduler AND before any components are restarted

In the previous commit, we refactored the old profiler codepath so that the `Profiler` would call `#reset_after_fork` on collectors before restarting them in a forking process. In this commit, we use the added hook to propagate the `#reset_after_fork` call to every component, so that they can clean up their internal state in the child process of a fork. In particular: * The chain starts on the `Collectors::CpuAndWallTimeWorker` which must start by disabling its active `TracePoint`s so that something like GC doesn't trigger a sample while we're still resetting the profiler. * Then the `Collectors::CpuAndWallTime` resets: a. Its per-thread tracking information. The native thread ids and CPU-time tracking of the thread that survived the fork need to be reset + all other threads that did not survive the fork need to be cleared. b. The internal stats need to be reset as well * Then the `StackRecorder` resets: a. The active slot and its concurrency control. This is to avoid any issues if the fork happened while a serialization attempt was ongoing and left the concurrency control in an inconsistent state. b. The profiles, so there's no left over data from the parent process in the child process profiles. The `#reset_after_fork` approach actually made the `StackRecorder#clear` method added recently in #2362 actually not be needed anymore, so I went ahead and removed it.

While working on validating that the new profiler is in good shape for Ruby apps that use `fork`, I observed that it's not safe to call `ddog_Profile_reset` without the Global VM Lock being held because that means it can be "interrupted" by a VM `fork` and left in an inconsistent state. To avoid this, I've moved the reset operation to be performed before we release the Global VM Lock.

ivoanjo · 2022-11-15T17:30:40Z

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time.c

@@ -91,7 +91,7 @@ struct cpu_and_wall_time_collector_state {
  // is not (just) a stat.
  unsigned int sample_count;

-  struct {
+  struct stats {


This struct was previously anonymous, but I needed to name it so I could reference it from _native_reset_after_fork

**What does this PR do?**: This PR adds type signatures and enables type checking for a number of profiling classes (see `Steepfile` for list of files that are no longer ignored). **Motivation**: This was discussed/requested in #2697. **Additional Notes**: I wish the steep errors were a bit more user-friendly. The `StackRecorder#clear` method actually should not exist anymore, and would actually break because it called `_native_clear` which was deleted in #2367. (Ooops) Other than that, there's a few minor changes to code to make code more type-checkable, but nothing of notice. **How to test the change?**: Validate that CI is still green and typechecking passes.

ivoanjo added 5 commits November 14, 2022 17:08

Remove leftover code for resetting thread native ids

e81b17e

Our `Thread` monkey patch, including `#update_native_ids` was removed in #1740. This code was left over in `setup.rb` since we had a fallback for `Thread`s without the monkey patch, but effectively never runs.

Skip spec that needs fork on JRuby

f54abd1

ivoanjo requested a review from a team November 15, 2022 17:29

github-actions bot added the profiling Involves Datadog profiling label Nov 15, 2022

ivoanjo commented Nov 15, 2022

View reviewed changes

marcotc approved these changes Nov 15, 2022

View reviewed changes

Base automatically changed from ivoanjo/better-profile-clearing to master November 16, 2022 08:41

ivoanjo merged commit dc92156 into master Nov 17, 2022

ivoanjo deleted the ivoanjo/prof-5943-handle-vm-forking-part2 branch November 17, 2022 08:48

github-actions bot added this to the 1.7.0 milestone Nov 17, 2022

ivoanjo mentioned this pull request Mar 17, 2023

Add type signatures for a bunch of profiling classes #2700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

ivoanjo commented Nov 15, 2022

ivoanjo Nov 15, 2022

[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

Conversation

ivoanjo commented Nov 15, 2022

ivoanjo Nov 15, 2022

Choose a reason for hiding this comment