-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Our `Thread` monkey patch, including `#update_native_ids` was removed in #1740. This code was left over in `setup.rb` since we had a fallback for `Thread`s without the monkey patch, but effectively never runs.
For the old profiler codepath, there's 3 pieces of state that need to be reset after a fork: 1. The profiling data inside the `OldRecorder` 2. The `@last_flush_finish_at` timestamp inside the `Exporter` 3. The cpu-time tracking inside the `Collectors::OldStack` (unchanged in this PR) Previously, 1) and 2) were triggered "indirectly" after a fork because the `Profiler` called `Scheduler#start`, which due to shared Datadog workers code calledd `Scheduler#after_fork`, which then called `Exporter#clear`. This worked fine, but for the new profiler codepath, this posed a few complications -- in particular, that cleaning up 1) and 2) could happen after the collector had already restarted so would need to take into account the potential concurrency issues. To simplify the new profiler codepath, let's instead make all of the state resetting: a) Explicit behind a call to `#reset_after_fork` that every component gets b) Sequential -- by making sure that `#reset_after_fork` gets called in the collectors before the scheduler AND before any components are restarted
In the previous commit, we refactored the old profiler codepath so that the `Profiler` would call `#reset_after_fork` on collectors before restarting them in a forking process. In this commit, we use the added hook to propagate the `#reset_after_fork` call to every component, so that they can clean up their internal state in the child process of a fork. In particular: * The chain starts on the `Collectors::CpuAndWallTimeWorker` which must start by disabling its active `TracePoint`s so that something like GC doesn't trigger a sample while we're still resetting the profiler. * Then the `Collectors::CpuAndWallTime` resets: a. Its per-thread tracking information. The native thread ids and CPU-time tracking of the thread that survived the fork need to be reset + all other threads that did not survive the fork need to be cleared. b. The internal stats need to be reset as well * Then the `StackRecorder` resets: a. The active slot and its concurrency control. This is to avoid any issues if the fork happened while a serialization attempt was ongoing and left the concurrency control in an inconsistent state. b. The profiles, so there's no left over data from the parent process in the child process profiles. The `#reset_after_fork` approach actually made the `StackRecorder#clear` method added recently in #2362 actually not be needed anymore, so I went ahead and removed it.
While working on validating that the new profiler is in good shape for Ruby apps that use `fork`, I observed that it's not safe to call `ddog_Profile_reset` without the Global VM Lock being held because that means it can be "interrupted" by a VM `fork` and left in an inconsistent state. To avoid this, I've moved the reset operation to be performed before we release the Global VM Lock.
ivoanjo
commented
Nov 15, 2022
@@ -91,7 +91,7 @@ struct cpu_and_wall_time_collector_state { | |||
// is not (just) a stat. | |||
unsigned int sample_count; | |||
|
|||
struct { | |||
struct stats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct was previously anonymous, but I needed to name it so I could reference it from _native_reset_after_fork
marcotc
approved these changes
Nov 15, 2022
ivoanjo
added a commit
that referenced
this pull request
Mar 17, 2023
**What does this PR do?**: This PR adds type signatures and enables type checking for a number of profiling classes (see `Steepfile` for list of files that are no longer ignored). **Motivation**: This was discussed/requested in #2697. **Additional Notes**: I wish the steep errors were a bit more user-friendly. The `StackRecorder#clear` method actually should not exist anymore, and would actually break because it called `_native_clear` which was deleted in #2367. (Ooops) Other than that, there's a few minor changes to code to make code more type-checkable, but nothing of notice. **How to test the change?**: Validate that CI is still green and typechecking passes.
ivoanjo
added a commit
that referenced
this pull request
Mar 20, 2023
**What does this PR do?**: This PR adds type signatures and enables type checking for a number of profiling classes (see `Steepfile` for list of files that are no longer ignored). **Motivation**: This was discussed/requested in #2697. **Additional Notes**: I wish the steep errors were a bit more user-friendly. The `StackRecorder#clear` method actually should not exist anymore, and would actually break because it called `_native_clear` which was deleted in #2367. (Ooops) Other than that, there's a few minor changes to code to make code more type-checkable, but nothing of notice. **How to test the change?**: Validate that CI is still green and typechecking passes.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?:
This PR is a follow-up from work started in #2359 and continued in #2362 for fixing and improving support for Ruby VM forks in the new profiler.
When a Ruby application forks (such as the puma webserver in multiprocess mode), we automatically restart the profiler in the child processes.
But, as part of restarting we should also make sure to reset any state that is leftover from the parent process.
This PR makes sure that, for the new profiler:
While the old profiler codepath was also affected by the refactoring in this PR, it was already doing the two steps above correctly; the changes in this PR are just to make it easier to support the feature in the new profiler.
Motivation:
It's common for Ruby apps to make use of fork, and this is a configuration we expect to continue to support in the new Ruby profiler.
Additional Notes:
This PR is easier to review commit-by-commit.
How to test the change?:
Besides the included test coverage, this can be manually seen in action by validating that the profiler (both on the old codepath and the new codepath) restarts correctly on child processes after a fork and that profiles reported by those child processes are correct and do not contain left-over state from the parent.