-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-3814] Limit number of threads per sample #1699
[PROF-3814] Limit number of threads per sample #1699
Commits on Sep 24, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 3b0e617 - Browse repository at this point
Copy the full SHA 3b0e617View commit details -
Add debug note when max_events limit was hit
Once this limit is hit, we drop part of the profiling data to make space for more events, which means the resulting profiles will be biased. (I've also created a ticket to track this issue and improve this mechanism.)
Configuration menu - View commit details
-
Copy full SHA for cb577e8 - Browse repository at this point
Copy the full SHA cb577e8View commit details -
Extract interval calculation method
I plan to use it to track per-thread wall clock time as well.
Configuration menu - View commit details
-
Copy full SHA for 1c7ac81 - Browse repository at this point
Copy the full SHA 1c7ac81View commit details -
Track last wall time per-thread via thread-local variables
Previously, we kept the `last_wall_time` as an instance variable of the Stack collector, and on every call to `#collecet_events` we would gather the current wall time, subtract from it the last wall time, and use that as the `wall_time_interval_ns` for all sampled threads. As preparation for being able to not sample all threads at once, we need to start storing the last wall time per-thread, so that if a thread is not chosen on a call to `#collect_events` to be sampled, but is chosen on the next one, the resulting wall clock count is still accurate. E.g. consider if a thread is sleeping, and we sample once per second BUT only sample that thread every other second. Without this change, the thread would be assigned 30s of wall time (incorrect); with this change, it is correctly assigned 60s of wall time. Note that effectively we already did this for cpu time calculation, and hence the refactoring basically just changed the wall time to use the common `#get_elapsed_since_last_sample_and_set_value` helper function.
Configuration menu - View commit details
-
Copy full SHA for 5b5b06f - Browse repository at this point
Copy the full SHA 5b5b06fView commit details -
[PROF-3814] Limit number of threads per sample
Prior to this change, the profiler tried to sample all threads each time it decided to sample. This could have a negative impact on the latency of Ruby services with a lot of threads, because each sampling pass would be quite expensive. Just as an illustration, consider the `benchmarks/profiler_sample_loop` benchmark which tries to sample as fast as possible 4 threads with very deep (500 frames+) stacks. As the number of threads to sample increases, so does the performance of sampling degrade: | Sampled threads | Iterations per second | |-----------------|-----------------------------------------------------| | 4 threads | 390.921 (± 6.1%) i/s - 3.920k in 10.075624s | | 16 threads | 97.037 (± 6.2%) i/s - 972.000 in 10.072298s | | 32 threads | 49.576 (± 4.0%) i/s - 496.000 in 10.038491s | | 64 threads | 24.060 (± 8.3%) i/s - 240.000 in 10.033481s | (All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19]) When combined with our dynamic sampling mechanism, if the profiler starts to take longer to sample, it just samples less often. BUT this means that it can impact service latency because suddenly we may be spending 40ms+ of CPU time (64 thread example + 500 frames, very extreme) which from experience (#1511, #1522) is not good. Instead, this PR changes the stack sampler to sample up to a set maximum of threads (16), randomly selected among all threads. This means that adding more threads to the system means very little degradation: | Sampled threads | Iterations per second | |-----------------|----------------------------------------------------| | 16 threads | 104.859 (± 6.7%) i/s - 1.050k in 10.068781s | | 32 threads | 100.957 (± 5.9%) i/s - 1.010k in 10.047250s | | 64 threads | 98.098 (± 5.1%) i/s - 981.000 in 10.038244s | | 256 threads | 84.740 (± 8.3%) i/s - 848.000 in 10.074037s | There's still a bit of degradation that I suspect is from pure VM overhead -- 256+ threads on a single Ruby VM adds up. Because we pick the threads to sample randomly, we'll eventually sample all threads -- just not at once. Finally, regarding the dynamic sampling mechanism, because the profiler will not take longer to sample, it will sample more often, and thus over a longer period we should take sample roughly the same samples. One downside of this approach is that if there really are many threads, the resulting wall clock times in a one minute profile may "drift" around the 60 second mark, e.g. maybe we only sampled a thread once per second and only 59 times, so we'll report 59s, but on the next report we'll include the missing one, so then the result will be 61s. I've observed 60 +- 1.68 secs for an app with ~65 threads, given the default maximum of 16 threads. This seems a reasonable enough margin of error given the improvement to latency (especially on such a large application! -> even bigger latency impact if we tried to sample all threads). If it helps, I've used this trivial Ruby app for testing with many threads: ```ruby 32.times do |n| eval(" def sleeper#{n} sleep end ") Thread.new { send(:"sleeper#{n}") } end sleep ```
Configuration menu - View commit details
-
Copy full SHA for ea6e5f7 - Browse repository at this point
Copy the full SHA ea6e5f7View commit details -
Workaround Process::Waiter crash due to per-thread wall clock time ac…
…counting So it seems like our friend the `Process::Waiter` crash is back with a twist (see #1466). TL;DR, `Process::Waiter` is a buggy subclass of `Thread` in Ruby 2.3 to 2.6 and we need to be careful when interacting with instances of this class on these Ruby versions. In particular, the per-thread wall clock time accounting relies on using `Thread#thread_variable_get` and `Thread#thread_variable_set`, which cause instant VM segfaults on instances of `Process::Waiter`. To work around this, let's just not care for wall-clock time from these buggy threads.
Configuration menu - View commit details
-
Copy full SHA for d308809 - Browse repository at this point
Copy the full SHA d308809View commit details -
Do not run Collectors::Stack specs on JRuby
We currently don't support JRuby (and JRuby was complaining about our referencing ::Process::Waiter, which MRI allows).
Configuration menu - View commit details
-
Copy full SHA for 2268a7a - Browse repository at this point
Copy the full SHA 2268a7aView commit details
Commits on Sep 27, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 64128d8 - Browse repository at this point
Copy the full SHA 64128d8View commit details -
Configuration menu - View commit details
-
Copy full SHA for f2d8b83 - Browse repository at this point
Copy the full SHA f2d8b83View commit details -
Fail profiler integration tests if profiler isn't loaded correctly
Otherwise the specs will fail with other cryptic errors.
Configuration menu - View commit details
-
Copy full SHA for e86a735 - Browse repository at this point
Copy the full SHA e86a735View commit details
Commits on Sep 28, 2021
-
Configuration menu - View commit details
-
Copy full SHA for d7a7e56 - Browse repository at this point
Copy the full SHA d7a7e56View commit details -
Ensure that wall clock timestamps are integers
This was causing some tests to fail in platforms where the values ended up as non-integer floating point values.
Configuration menu - View commit details
-
Copy full SHA for 9559c4c - Browse repository at this point
Copy the full SHA 9559c4cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3508e0b - Browse repository at this point
Copy the full SHA 3508e0bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 216daa4 - Browse repository at this point
Copy the full SHA 216daa4View commit details