-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More profiler pprof encoding performance improvements #1512
Conversation
…hash One of the things I've learned since I started doing some performance work in the profiler, is how expensive is to hash arrays of items, as `MessageSet` (`ObjectSet`) does. While looking for improvements to pprof encoding, I realized that the locations `MessageSet` was implementing a hashmap of `Profiling::BacktraceLocation`s to pprof `Location` objects in a rather roundabout way -- it used the full contents of the `BacktraceLocation` objects as a key, rather than the object itself. By replacing this arrangement with a regular Ruby hash that used the `BacktraceLocation` as a key, I was able to quite improve the performance of pprof encoding. In the <benchmarks/profiler_submission.rb> benchmark (Ruby 2.7.3/macOS 10.15.7/i7-1068NG7) I'm getting a > 2x speedup when compared with current master (or 3.4x speedup over dd-trace-rb 0.49.0): ``` exporter locations-hashmap 9.241 (± 0.0%) i/s - 93.000 in 10.081019s Comparison: exporter locations-hashmap: 9.2 i/s exporter cached-build-location-v2 (current master): 4.4 i/s - 2.12x (± 0.00) slower exporter baseline (before any work): 2.7 i/s - 3.42x (± 0.00) slower ```
When trying to read values out of protobuf objects, there are no regular getters -- only `[]` and `method_missing`. Using `location['id']` instead of `location.id` yields a small but noticeable speedup in the <benchmarks/profiler_submission.rb> benchmark (Ruby 2.7.3/macOS 10.15.7/i7-1068NG7): ``` exporter locations-avoid-method-missing 9.944 (± 0.0%) i/s - 100.000 in 10.065687s Comparison: exporter locations-avoid-method-missing: 9.9 i/s exporter locations-hashmap: 9.2 i/s - 1.08x (± 0.00) slower exporter cached-build-location-v2 (current master): 4.4 i/s - 2.28x (± 0.00) slower exporter baseline (before any work): 2.7 i/s - 3.68x (± 0.00) slower ```
Performance improvement is negligible, but why not: ``` exporter objectset-yield 10.028 (± 0.0%) i/s - 101.000 in 10.079207s Comparison: exporter objectset-yield: 10.0 i/s exporter locations-avoid-method-missing: 9.9 i/s - 1.01x (± 0.00) slower exporter locations-hashmap: 9.2 i/s - 1.09x (± 0.00) slower exporter cached-build-location-v2 (current master): 4.4 i/s - 2.30x (± 0.00) slower exporter baseline (before any work): 2.7 i/s - 3.72x (± 0.00) slower ```
I've benchmarked these improvements (together with my earlier PR #1511) and their impact on p100 latency on discourse. On this benchmark, pprof encoding was having a non-trivial impact on latency. The spikes you see on the left are once per minute -- the request(s) that are concurrent with pprof encoding. With this PR + #1511 (right), there still are measurable latency spikes, but the magnitude of spikes is very reduced. (Datadog-internal links to full results: before, after). |
"#{omitted} #{desc}", | ||
&@build_location | ||
) | ||
locations << @locations[Profiling::BacktraceLocation.new(''.freeze, 0, "#{omitted} #{desc}")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend trying to remove #freeze, and instead adding the 'freeze string literals' magic comment atop the file.
My experience is that the magic comment is measurably faster than invoking #freeze, due to its compilation to a specialized instruction.
I understand that this only works with 2.3+, but I don't think it makes sense to penalize the performance for 2.3+ users, and also it's too hard to maintain two version of this method (one with #freeze and the other without).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in c01d6dd . At least for the empty string, there's no change in the benchmark
exporter inline-freeze
9.208 (± 0.0%) i/s - 92.000 in 10.001706s
exporter pragma-freeze
9.181 (± 0.0%) i/s - 92.000 in 10.030500s
Comparison:
exporter inline-freeze: 9.2 i/s
exporter pragma-freeze: 9.2 i/s - 1.00x (± 0.00) slower
but I don't mind either :)
Seems to not make any difference in the <benchmarks/profiler_submission.rb> benchmark (Ruby 2.7.3/macOS 10.15.7/i7-1068NG7): ``` exporter inline-freeze 9.208 (± 0.0%) i/s - 92.000 in 10.001706s exporter pragma-freeze 9.181 (± 0.0%) i/s - 92.000 in 10.030500s Comparison: exporter inline-freeze: 9.2 i/s exporter pragma-freeze: 9.2 i/s - 1.00x (± 0.00) slower ``` but I like to use the more modern approach anyway.
c01d6dd
to
7c29190
Compare
A second round of performance improvements to pprof encoding. Compared to dd-trace-rb 0.49.0, there's a 3.7x speedup, or 2.27x speedup when comparing with current master.
This takes care of the low-hanging fruit I could identify in pprof encoding.