-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-7307] Add support for allocation samples to ThreadContext
#2657
Conversation
This comes in handy while debugging.
**What does this PR do?**: This PR adds the `thread_context_collector_sample_allocation` API to the `ThreadContext` collector. This will be used to support allocation profiling (but right now it's not being called automatically when allocations happen, yet). **Motivation**: This is an incremental step towards supporting allocation profiling. I'm still working on the scaling/weighting strategy, and may end up adding that logic directly on the `ThreadContext`, haven't decided yet. **Additional Notes**: N/A **How to test the change?**: Change includes test coverage. End-to-end testing will come later once this is wired up in the `CpuAndWallTimeWorker`.
Codecov Report
@@ Coverage Diff @@
## master #2657 +/- ##
=======================================
Coverage 98.07% 98.07%
=======================================
Files 1166 1166
Lines 63832 63871 +39
Branches 2849 2849
=======================================
+ Hits 62603 62643 +40
+ Misses 1229 1228 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -1,5 +1,8 @@ | |||
$LOAD_PATH.unshift File.expand_path('..', __dir__) | |||
$LOAD_PATH.unshift File.expand_path('../lib', __dir__) | |||
|
|||
Thread.main.name = 'Thread.main' unless Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.3') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to make it easier to debug when we have multiple threads -- if you print/list the threads it's sometimes not obvious which thread is the main one (and main is the one driving the RSpec framework). It's unrelated to the other changes (which is why I put it on a separate commit -- 67b7fa8 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one example where this is useful. This spec failed and printed the list of threads:
[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>]
This test failed in a flaky way in CI (<https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/9211/workflows/8431a38e-fa1d-4a93-a937-b8488b960a89/jobs/342577/tests#failed-test-0>). Interesting observation in the output: ``` Failure/Error: expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}" sample_count: 7, stats: {:trigger_sample_attempts=>8, :trigger_simulated_signal_delivery_attempts=>8, :simulated_signal_delivery=>8, :signal_handler_enqueued_sample=>8, :signal_handler_wrong_thread=>0, :sampled=>7, :sampling_time_ns_min=>122586, :sampling_time_ns_max=>2849779, :sampling_time_ns_total=>3664094, :sampling_time_ns_avg=>523442.0}, debug_failures: {:thread_list=>[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>], ``` There was at least one sample that was super slow (`sampling_time_ns_max` -- almost 3ms). Because of that, the profiler throttled sampling down (using the dynamic sampling rate component), and so did not take as many samples as we were expecting. This is by design, although it's really hard to know _why_ the profiler slowed down -- maybe the CI machine was busy? Maybe that leftover webrick thread leaked from another test was the culprit? Right now the dynamic sampling rate is by design not very configurable, so I've chosen to tweak the number of samples down yet again. If this last adjustment isn't enough, I'll bite the bullet and make it possible to disable the dynamic sampling rate (or tweak it), so as to not cause issues for this spec. (Like I mentioned in the comment -- I still think this test is worth the trouble, as it validates an important behavior of the profiler that otherwise may not get exercised in most validations we run -- profiling while idle).
**What does this PR do?** In #2657 we added support for the `Collectors::ThreadContext` to sample a given object allocation. But until now, triggering these kinds of allocations was only done in our test suite. This PR adds the missing part: actually sampling objects as Ruby allocates them. In the `Collectors::CpuAndWallTimeWorker`, we already had a `TracePoint` that is called on every new object allocation (`RUBY_INTERNAL_EVENT_NEWOBJ`), but we were only using it to count allocations. This PR changes the `TracePoint` code to actually call into the `Collectors::ThreadContext` and thus allow allocation sampling to happen. **Motivation:** This PR is another step towards the allocation profiling feature. **Additional Notes:** It's not yet possible to turn on allocation profiling via configuration. I am reserving that change for later. But it's easy to do: change the `profiling/component.rb` to use `alloc_samples_enabled: true` and `allocation_sample_every: 1`. This will result in a big overhead impact to the application. I'll be working on this in later PRs. In particular, the sampling approach we're using (`allocation_sample_every`), is only a placeholder; it will get replaced before we get to beta. This is why I don't want to add it as a real user-available setting -- because it's temporary. **How to test the change?** This change includes test coverage. You can also manually check it out, by following the instructions in the "Additional Notes". (There's a feature flag needed for this data to show up in the Datadog UX).
What does this PR do?:
This PR adds the
thread_context_collector_sample_allocation
API to theThreadContext
collector.This will be used to support allocation profiling (but right now it's not being called automatically when allocations happen, yet).
Motivation:
This is an incremental step towards supporting allocation profiling.
I'm still working on the scaling/weighting strategy, and may end up adding that logic directly on the
ThreadContext
, haven't decided yet.Additional Notes:
I'm opening this PR is on top of #2654 to avoid merge conflicts, but this PR is otherwise completely independent from it.
How to test the change?:
Change includes test coverage. End-to-end testing will come later once this is wired up in the
CpuAndWallTimeWorker
.