Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-3814] Limit number of threads per sample #1699

Merged
merged 14 commits into from
Sep 29, 2021

Commits on Sep 24, 2021

  1. Configuration menu
    Copy the full SHA
    3b0e617 View commit details
    Browse the repository at this point in the history
  2. Add debug note when max_events limit was hit

    Once this limit is hit, we drop part of the profiling data to make
    space for more events, which means the resulting profiles will be
    biased.
    
    (I've also created a ticket to track this issue and improve this
    mechanism.)
    ivoanjo committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    cb577e8 View commit details
    Browse the repository at this point in the history
  3. Extract interval calculation method

    I plan to use it to track per-thread wall clock time as well.
    ivoanjo committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    1c7ac81 View commit details
    Browse the repository at this point in the history
  4. Track last wall time per-thread via thread-local variables

    Previously, we kept the `last_wall_time` as an instance variable of
    the Stack collector, and on every call to `#collecet_events` we would
    gather the current wall time, subtract from it the last wall time, and
    use that as the `wall_time_interval_ns` for all sampled threads.
    
    As preparation for being able to not sample all threads at once, we
    need to start storing the last wall time per-thread, so that if
    a thread is not chosen on a call to `#collect_events` to be sampled,
    but is chosen on the next one, the resulting wall clock count is
    still accurate.
    
    E.g. consider if a thread is sleeping, and we sample once per second
    BUT only sample that thread every other second. Without this change,
    the thread would be assigned 30s of wall time (incorrect); with this
    change, it is correctly assigned 60s of wall time.
    
    Note that effectively we already did this for cpu time calculation,
    and hence the refactoring basically just changed the wall time to
    use the common `#get_elapsed_since_last_sample_and_set_value` helper
    function.
    ivoanjo committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    5b5b06f View commit details
    Browse the repository at this point in the history
  5. [PROF-3814] Limit number of threads per sample

    Prior to this change, the profiler tried to sample all threads each
    time it decided to sample.
    
    This could have a negative impact on the latency of Ruby services with
    a lot of threads, because each sampling pass would be quite expensive.
    
    Just as an illustration, consider the
    `benchmarks/profiler_sample_loop` benchmark which tries to sample as
    fast as possible 4 threads with very deep (500 frames+) stacks.
    
    As the number of threads to sample increases, so does the performance
    of sampling degrade:
    
    | Sampled threads | Iterations per second                               |
    |-----------------|-----------------------------------------------------|
    | 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
    | 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
    | 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
    | 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |
    
    (All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])
    
    When combined with our dynamic sampling mechanism, if the profiler
    starts to take longer to sample, it just samples less often.
    
    BUT this means that it can impact service latency because suddenly we
    may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
    extreme) which from experience (#1511, #1522) is not good.
    
    Instead, this PR changes the stack sampler to sample up to a set
    maximum of threads (16), randomly selected among all threads.
    
    This means that adding more threads to the system means very little
    degradation:
    
    | Sampled threads | Iterations per second                              |
    |-----------------|----------------------------------------------------|
    | 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
    | 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
    | 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
    | 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |
    
    There's still a bit of degradation that I suspect is from pure VM
    overhead -- 256+ threads on a single Ruby VM adds up.
    
    Because we pick the threads to sample randomly, we'll eventually sample
    all threads -- just not at once.
    
    Finally, regarding the dynamic sampling mechanism, because the profiler
    will not take longer to sample, it will sample more often, and thus
    over a longer period we should take sample roughly the same samples.
    
    One downside of this approach is that if there really are many threads,
    the resulting wall clock times in a one minute profile may "drift" around
    the 60 second mark, e.g. maybe we only sampled a thread once per
    second and only 59 times, so we'll report 59s, but on the next report
    we'll include the missing one, so then the result will be 61s.
    I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
    default maximum of 16 threads.
    This seems a reasonable enough margin of error given the improvement to
    latency (especially on such a large application! -> even bigger latency
    impact if we tried to sample all threads).
    
    If it helps, I've used this trivial Ruby app for testing with many
    threads:
    
    ```ruby
    32.times do |n|
      eval("
        def sleeper#{n}
          sleep
        end
      ")
    
      Thread.new { send(:"sleeper#{n}") }
    end
    
    sleep
    ```
    ivoanjo committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    ea6e5f7 View commit details
    Browse the repository at this point in the history
  6. Workaround Process::Waiter crash due to per-thread wall clock time ac…

    …counting
    
    So it seems like our friend the `Process::Waiter` crash is back with
    a twist (see #1466).
    
    TL;DR, `Process::Waiter` is a buggy subclass of `Thread` in Ruby 2.3 to
    2.6 and we need to be careful when interacting with instances of this
    class on these Ruby versions.
    
    In particular, the per-thread wall clock time accounting relies on
    using `Thread#thread_variable_get` and `Thread#thread_variable_set`,
    which cause instant VM segfaults on instances of `Process::Waiter`.
    
    To work around this, let's just not care for wall-clock time from
    these buggy threads.
    ivoanjo committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    d308809 View commit details
    Browse the repository at this point in the history
  7. Do not run Collectors::Stack specs on JRuby

    We currently don't support JRuby (and JRuby was complaining about
    our referencing ::Process::Waiter, which MRI allows).
    ivoanjo committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    2268a7a View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2021

  1. Configuration menu
    Copy the full SHA
    64128d8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f2d8b83 View commit details
    Browse the repository at this point in the history
  3. Fail profiler integration tests if profiler isn't loaded correctly

    Otherwise the specs will fail with other cryptic errors.
    ivoanjo committed Sep 27, 2021
    Configuration menu
    Copy the full SHA
    e86a735 View commit details
    Browse the repository at this point in the history

Commits on Sep 28, 2021

  1. Configuration menu
    Copy the full SHA
    d7a7e56 View commit details
    Browse the repository at this point in the history
  2. Ensure that wall clock timestamps are integers

    This was causing some tests to fail in platforms where the values
    ended up as non-integer floating point values.
    ivoanjo committed Sep 28, 2021
    Configuration menu
    Copy the full SHA
    9559c4c View commit details
    Browse the repository at this point in the history
  3. Minor rubocop cleanup

    ivoanjo committed Sep 28, 2021
    Configuration menu
    Copy the full SHA
    3508e0b View commit details
    Browse the repository at this point in the history
  4. Fix broken English in spec

    ivoanjo committed Sep 28, 2021
    Configuration menu
    Copy the full SHA
    216daa4 View commit details
    Browse the repository at this point in the history