Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloc profile performance improvements #6

Merged
merged 4 commits into from
Dec 21, 2021

Commits on Dec 21, 2021

  1. Remove file println debugging

    NHDaly committed Dec 21, 2021
    Configuration menu
    Copy the full SHA
    6c36541 View commit details
    Browse the repository at this point in the history
  2. Add precompile statements to AllocProfile package

    This _drastically_ speeds up the tests, for reasons I don't exactly
    understand.. I wonder if it was messing up some heuristics and deciding
    to interpret the code instead of compiling it, and had some weird
    corneer cases in the interpreted code or something? I dunno!
    
    But anyway, this drastically speeds it up, so 🤷 sounds like not our
    problem 😊
    NHDaly committed Dec 21, 2021
    Configuration menu
    Copy the full SHA
    8da8384 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e314827 View commit details
    Browse the repository at this point in the history
  4. Malloc right-sized buffers for backtraces.

    Instead of allocating a maximum-sized buffer for each backtrace, we keep
    a single max-sized buffer as a scratch space, write the backtrace to it,
    and then once we know the size, we allocate a right-sized buffer for the
    backtrace and copy it over.
    
    Benchmark results (measured time for profiling allocations on internal
    Arroyo benchmark, with `skip_every=0`):
    
    This only slightly improves the time to record an alloctions profile:
    - Before: 275.082525 seconds
    - After: 245.891006 seconds
    
    But it drastically improves the memory usage once the profiling is
    completed, according to System Activity Monitor:
    - Before: 17.35 GB
    - After: 6.92 GB
    - (Compared to 350 MB for the same task without profiling)
    
    We could probably slightly improve the time overhead still furthur by
    using a single big vector instead of a bunch of individual allocated
    buffers, but this is probably about the best we could do in terms of
    space usage. This would allow us to eliminate the redundant copying, and
    would also amortize away the allocations of the buffers, both of which
    should reduce the performance impact. But I'm guessing the time is
    mostly dominated by just how long the stack traces are, and there's no
    getting around that. At best, we could expect maybe like a 2x-3x
    improvement from those changes, I think.
    NHDaly committed Dec 21, 2021
    Configuration menu
    Copy the full SHA
    cae4480 View commit details
    Browse the repository at this point in the history