Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-3814] Limit number of threads per sample #1699

Merged
merged 14 commits into from
Sep 29, 2021

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Sep 24, 2021

Prior to this change, the profiler tried to sample all threads each time it decided to sample.

This could have a negative impact on the latency of Ruby services with a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the benchmarks/profiler_sample_loop benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance of sampling degrade:

Sampled threads Iterations per second
4 threads 390.921 (± 6.1%) i/s - 3.920k in 10.075624s
16 threads 97.037 (± 6.2%) i/s - 972.000 in 10.072298s
32 threads 49.576 (± 4.0%) i/s - 496.000 in 10.038491s
64 threads 24.060 (± 8.3%) i/s - 240.000 in 10.033481s

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we may be spending 40ms+ of CPU time (64 thread example + 500 frames, very extreme) each time the profiler decides to sample, which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little degradation:

Sampled threads Iterations per second
16 threads 104.859 (± 6.7%) i/s - 1.050k in 10.068781s
32 threads 100.957 (± 5.9%) i/s - 1.010k in 10.047250s
64 threads 98.098 (± 5.1%) i/s - 981.000 in 10.038244s
256 threads 84.740 (± 8.3%) i/s - 848.000 in 10.074037s

There's still a bit of degradation that I suspect is from pure VM overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler will not take longer to sample, it will sample more often, and thus over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads, the resulting wall clock times in a one minute profile may "drift" around the 60 second mark, e.g. maybe we only sampled a thread once per second and only 59 times, so we'll report 59s, but on the next report we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to latency (especially on such a large application! -> even bigger latency impact if we tried to sample all threads).

Also included is a change to track the wall clock time per-thread, bringing it in line with what we already did for cpu time (otherwise the resulting profile would be incorrect).

Once this limit is hit, we drop part of the profiling data to make
space for more events, which means the resulting profiles will be
biased.

(I've also created a ticket to track this issue and improve this
mechanism.)
I plan to use it to track per-thread wall clock time as well.
Previously, we kept the `last_wall_time` as an instance variable of
the Stack collector, and on every call to `#collecet_events` we would
gather the current wall time, subtract from it the last wall time, and
use that as the `wall_time_interval_ns` for all sampled threads.

As preparation for being able to not sample all threads at once, we
need to start storing the last wall time per-thread, so that if
a thread is not chosen on a call to `#collect_events` to be sampled,
but is chosen on the next one, the resulting wall clock count is
still accurate.

E.g. consider if a thread is sleeping, and we sample once per second
BUT only sample that thread every other second. Without this change,
the thread would be assigned 30s of wall time (incorrect); with this
change, it is correctly assigned 60s of wall time.

Note that effectively we already did this for cpu time calculation,
and hence the refactoring basically just changed the wall time to
use the common `#get_elapsed_since_last_sample_and_set_value` helper
function.
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads,
the resulting wall clock times in a one minute profile may "drift" around
the 60 second mark, e.g. maybe we only sampled a thread once per
second and only 59 times, so we'll report 59s, but on the next report
we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to
latency (especially on such a large application! -> even bigger latency
impact if we tried to sample all threads).

If it helps, I've used this trivial Ruby app for testing with many
threads:

```ruby
32.times do |n|
  eval("
    def sleeper#{n}
      sleep
    end
  ")

  Thread.new { send(:"sleeper#{n}") }
end

sleep
```
…counting

So it seems like our friend the `Process::Waiter` crash is back with
a twist (see #1466).

TL;DR, `Process::Waiter` is a buggy subclass of `Thread` in Ruby 2.3 to
2.6 and we need to be careful when interacting with instances of this
class on these Ruby versions.

In particular, the per-thread wall clock time accounting relies on
using `Thread#thread_variable_get` and `Thread#thread_variable_set`,
which cause instant VM segfaults on instances of `Process::Waiter`.

To work around this, let's just not care for wall-clock time from
these buggy threads.
We currently don't support JRuby (and JRuby was complaining about
our referencing ::Process::Waiter, which MRI allows).
@ivoanjo ivoanjo requested a review from a team September 24, 2021 12:00
Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: why 16 threads?

@ivoanjo
Copy link
Member Author

ivoanjo commented Sep 27, 2021

One question: why 16 threads?

Good question. This seemed a reasonable value to me for a Ruby application, given that our target is MRI -- if you're using more threads than that (I've certainly worked on and seen such Ruby apps), then we want to be especially careful to not "make things worse".

I'll add a note to the code to mention that this value was "reasonable at time of writing", but definitely not something we should be locked into.

This was causing some tests to fail in platforms where the values
ended up as non-integer floating point values.
@codecov-commenter
Copy link

Codecov Report

Merging #1699 (3508e0b) into master (80af92a) will increase coverage by 0.00%.
The diff coverage is 96.21%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #1699    +/-   ##
========================================
  Coverage   98.31%   98.31%            
========================================
  Files         927      928     +1     
  Lines       44617    44760   +143     
========================================
+ Hits        43864    44006   +142     
- Misses        753      754     +1     
Impacted Files Coverage Δ
lib/ddtrace/configuration/settings.rb 100.00% <ø> (ø)
lib/ddtrace/contrib/rails/framework.rb 100.00% <ø> (ø)
lib/ddtrace/profiling/encoding/profile.rb 95.65% <80.00%> (-4.35%) ⬇️
lib/ddtrace/contrib/active_job/log_injection.rb 90.00% <90.00%> (ø)
spec/ddtrace/profiling/collectors/stack_spec.rb 98.62% <93.84%> (-0.43%) ⬇️
...ib/ddtrace/contrib/action_cable/instrumentation.rb 94.28% <94.73%> (+0.53%) ⬆️
lib/datadog/ci/ext/environment.rb 99.39% <100.00%> (ø)
lib/ddtrace/contrib/action_cable/patcher.rb 100.00% <100.00%> (ø)
...trace/contrib/active_job/configuration/settings.rb 100.00% <100.00%> (ø)
lib/ddtrace/contrib/active_job/patcher.rb 100.00% <100.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b026d27...3508e0b. Read the comment docs.

@ivoanjo
Copy link
Member Author

ivoanjo commented Sep 28, 2021

Third push's the charm!

Sorry for the noise, it took a bit of investigation to pin down the source of the flaky CI tests I was seeing, but it was indeed a real issue: the pprof format we use for profiles doesn't accept floating point values for wall-clock time, but bizzarely accepts something like 1.0 . Thus, the test failing or not depended on the actual values of the system timer.

Fixed in 9559c4c and now I have a sea of green. I've also doublechecked on the reliability environment with GitLab.

@ivoanjo ivoanjo merged commit b1bd0c0 into master Sep 29, 2021
@ivoanjo ivoanjo deleted the ivoanjo/prof-3814-limit-number-of-threads-sampled-v2 branch September 29, 2021 09:37
@github-actions github-actions bot added this to the 0.53.0 milestone Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants