Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[PROF-3814] Limit number of threads per sample
Prior to this change, the profiler tried to sample all threads each time it decided to sample. This could have a negative impact on the latency of Ruby services with a lot of threads, because each sampling pass would be quite expensive. Just as an illustration, consider the `benchmarks/profiler_sample_loop` benchmark which tries to sample as fast as possible 4 threads with very deep (500 frames+) stacks. As the number of threads to sample increases, so does the performance of sampling degrade: | Sampled threads | Iterations per second | |-----------------|-----------------------------------------------------| | 4 threads | 390.921 (± 6.1%) i/s - 3.920k in 10.075624s | | 16 threads | 97.037 (± 6.2%) i/s - 972.000 in 10.072298s | | 32 threads | 49.576 (± 4.0%) i/s - 496.000 in 10.038491s | | 64 threads | 24.060 (± 8.3%) i/s - 240.000 in 10.033481s | (All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19]) When combined with our dynamic sampling mechanism, if the profiler starts to take longer to sample, it just samples less often. BUT this means that it can impact service latency because suddenly we may be spending 40ms+ of CPU time (64 thread example + 500 frames, very extreme) which from experience (#1511, #1522) is not good. Instead, this PR changes the stack sampler to sample up to a set maximum of threads (16), randomly selected among all threads. This means that adding more threads to the system means very little degradation: | Sampled threads | Iterations per second | |-----------------|----------------------------------------------------| | 16 threads | 104.859 (± 6.7%) i/s - 1.050k in 10.068781s | | 32 threads | 100.957 (± 5.9%) i/s - 1.010k in 10.047250s | | 64 threads | 98.098 (± 5.1%) i/s - 981.000 in 10.038244s | | 256 threads | 84.740 (± 8.3%) i/s - 848.000 in 10.074037s | There's still a bit of degradation that I suspect is from pure VM overhead -- 256+ threads on a single Ruby VM adds up. Because we pick the threads to sample randomly, we'll eventually sample all threads -- just not at once. Finally, regarding the dynamic sampling mechanism, because the profiler will not take longer to sample, it will sample more often, and thus over a longer period we should take sample roughly the same samples.
- Loading branch information