Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

Merged
merged 5 commits into from
Nov 17, 2022

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Nov 15, 2022

What does this PR do?:

This PR is a follow-up from work started in #2359 and continued in #2362 for fixing and improving support for Ruby VM forks in the new profiler.

When a Ruby application forks (such as the puma webserver in multiprocess mode), we automatically restart the profiler in the child processes.

But, as part of restarting we should also make sure to reset any state that is leftover from the parent process.

This PR makes sure that, for the new profiler:

  1. Resetting the state is done in a safe way (e.g. without concurrency issues). This required a refactoring of how we do it.
  2. There is no leftover state from the parent.

While the old profiler codepath was also affected by the refactoring in this PR, it was already doing the two steps above correctly; the changes in this PR are just to make it easier to support the feature in the new profiler.

Motivation:

It's common for Ruby apps to make use of fork, and this is a configuration we expect to continue to support in the new Ruby profiler.

Additional Notes:

This PR is easier to review commit-by-commit.

How to test the change?:

Besides the included test coverage, this can be manually seen in action by validating that the profiler (both on the old codepath and the new codepath) restarts correctly on child processes after a fork and that profiles reported by those child processes are correct and do not contain left-over state from the parent.

Our `Thread` monkey patch, including `#update_native_ids` was removed
in #1740.

This code was left over in `setup.rb` since we had a fallback for
`Thread`s without the monkey patch, but effectively never runs.
For the old profiler codepath, there's 3 pieces of state that need to
be reset after a fork:

1. The profiling data inside the `OldRecorder`
2. The `@last_flush_finish_at` timestamp inside the `Exporter`
3. The cpu-time tracking inside the `Collectors::OldStack` (unchanged
   in this PR)

Previously, 1) and 2) were triggered "indirectly" after a fork
because the `Profiler` called `Scheduler#start`, which due to shared
Datadog workers code calledd `Scheduler#after_fork`, which then
called `Exporter#clear`.

This worked fine, but for the new profiler codepath, this posed a
few complications -- in particular, that cleaning up 1) and 2) could
happen after the collector had already restarted so would need to
take into account the potential concurrency issues.

To simplify the new profiler codepath, let's instead make all of the
state resetting:

a) Explicit behind a call to `#reset_after_fork` that every component
   gets
b) Sequential -- by making sure that `#reset_after_fork` gets called
   in the collectors before the scheduler AND before any components
   are restarted
In the previous commit, we refactored the old profiler codepath so that
the `Profiler` would call `#reset_after_fork` on collectors before
restarting them in a forking process.

In this commit, we use the added hook to propagate the
`#reset_after_fork` call to every component, so that they can clean up
their internal state in the child process of a fork.

In particular:
* The chain starts on the `Collectors::CpuAndWallTimeWorker` which must
  start by disabling its active `TracePoint`s so that something like
  GC doesn't trigger a sample while we're still resetting the profiler.

* Then the `Collectors::CpuAndWallTime` resets:
  a. Its per-thread tracking information. The native thread ids and
     CPU-time tracking of the thread that survived the fork need to
     be reset + all other threads that did not survive the fork need
     to be cleared.
  b. The internal stats need to be reset as well

* Then the `StackRecorder` resets:
  a. The active slot and its concurrency control. This is to avoid any
     issues if the fork happened while a serialization attempt was
     ongoing and left the concurrency control in an inconsistent
     state.
  b. The profiles, so there's no left over data from the parent
     process in the child process profiles.

The `#reset_after_fork` approach actually made the
`StackRecorder#clear` method added recently in #2362 actually not
be needed anymore, so I went ahead and removed it.
While working on validating that the new profiler is in good shape for
Ruby apps that use `fork`, I observed that it's not safe to call
`ddog_Profile_reset` without the Global VM Lock being held because
that means it can be "interrupted" by a VM `fork` and left in an
inconsistent state.

To avoid this, I've moved the reset operation to be performed before
we release the Global VM Lock.
@ivoanjo ivoanjo requested a review from a team November 15, 2022 17:29
@github-actions github-actions bot added the profiling Involves Datadog profiling label Nov 15, 2022
@@ -91,7 +91,7 @@ struct cpu_and_wall_time_collector_state {
// is not (just) a stat.
unsigned int sample_count;

struct {
struct stats {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct was previously anonymous, but I needed to name it so I could reference it from _native_reset_after_fork

Base automatically changed from ivoanjo/better-profile-clearing to master November 16, 2022 08:41
@ivoanjo ivoanjo merged commit dc92156 into master Nov 17, 2022
@ivoanjo ivoanjo deleted the ivoanjo/prof-5943-handle-vm-forking-part2 branch November 17, 2022 08:48
@github-actions github-actions bot added this to the 1.7.0 milestone Nov 17, 2022
ivoanjo added a commit that referenced this pull request Mar 17, 2023
**What does this PR do?**:

This PR adds type signatures and enables type checking for a number
of profiling classes (see `Steepfile` for list of files that are no
longer ignored).

**Motivation**:

This was discussed/requested in #2697.

**Additional Notes**:

I wish the steep errors were a bit more user-friendly.

The `StackRecorder#clear` method actually should not exist anymore,
and would actually break because it called `_native_clear` which was
deleted in #2367. (Ooops)

Other than that, there's a few minor changes to code to make code more
type-checkable, but nothing of notice.

**How to test the change?**:

Validate that CI is still green and typechecking passes.
ivoanjo added a commit that referenced this pull request Mar 20, 2023
**What does this PR do?**:

This PR adds type signatures and enables type checking for a number
of profiling classes (see `Steepfile` for list of files that are no
longer ignored).

**Motivation**:

This was discussed/requested in #2697.

**Additional Notes**:

I wish the steep errors were a bit more user-friendly.

The `StackRecorder#clear` method actually should not exist anymore,
and would actually break because it called `_native_clear` which was
deleted in #2367. (Ooops)

Other than that, there's a few minor changes to code to make code more
type-checkable, but nothing of notice.

**How to test the change?**:

Validate that CI is still green and typechecking passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants