Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocation profiler: get stacks for pool allocs via wrapper function #43868

Merged
merged 14 commits into from
Feb 7, 2022

Conversation

vilterp
Copy link
Contributor

@vilterp vilterp commented Jan 19, 2022

followup to #42768
fixes #43688 (gets the stacks, though not the types, of pool allocs)

Add a wrapper around jl_gc_pool_alloc to allow us to instrument it.

Before

graphviz (1)

*: instrumented; red: currently missed

graphviz
digraph G {
  node [shape=box];

  generated_code [label="<generated code>"];

  new_array;
  jl_gc_alloc_string [label="jl_gc_alloc_string *"];
  jl_gc_alloc [label="jl_gc_alloc *"];
  jl_gc_managed_malloc [label="jl_managed_malloc *"]; // https://github.com/vilterp/julia/pull/20
  jl_gc_pool_alloc;
  jl_gc_big_alloc;

  {
    rank=same;
    jl_gc_pool_alloc;
    jl_gc_managed_malloc;
    jl_gc_big_alloc;
  }

  generated_code -> jl_gc_alloc_string;
  generated_code -> new_array;
  generated_code -> jl_gc_alloc;
  generated_code -> jl_gc_big_alloc [color=red];
  generated_code -> jl_gc_pool_alloc [color=red];

  jl_gc_alloc -> jl_gc_pool_alloc;
  jl_gc_alloc -> jl_gc_big_alloc;
  new_array -> jl_gc_alloc;
  new_array -> jl_gc_managed_malloc;
  jl_gc_alloc_string -> jl_gc_pool_alloc;
  jl_gc_alloc_string -> jl_gc_big_alloc;
}

After

graphviz (2)

*: instrumented; red: currently missed; green: newly introduced function

graphviz
digraph G {
  node [shape=box];

  generated_code [label="<generated code>"];

  new_array;
  jl_gc_alloc_string [label="jl_gc_alloc_string *"];
  jl_gc_alloc [label="jl_gc_alloc *"];
  jl_gc_managed_malloc [label="jl_managed_malloc *"]; // https://github.com/vilterp/julia/pull/20
  jl_gc_pool_alloc [fillcolor="#90ed8f" style=filled label="jl_gc_pool_alloc *"];
  jl_gc_pool_alloc_noinline [fillcolor="#90ed8f" style=filled label="jl_gc_pool_alloc_noinline"];
  jl_gc_pool_alloc_inner;
  jl_gc_big_alloc;

  {
    rank=same;
    jl_gc_managed_malloc;
    jl_gc_pool_alloc_noinline;
    jl_gc_big_alloc;
  }

  generated_code -> jl_gc_alloc_string;
  generated_code -> new_array;
  generated_code -> jl_gc_alloc;
  generated_code -> jl_gc_big_alloc [color=red];
  generated_code -> jl_gc_pool_alloc;
  jl_gc_pool_alloc -> jl_gc_pool_alloc_inner  [style=dotted label="inlined"];
  jl_gc_pool_alloc_noinline -> jl_gc_pool_alloc_inner  [style=dotted  label="inlined"];


  jl_gc_alloc -> jl_gc_pool_alloc_noinline;
  jl_gc_alloc -> jl_gc_big_alloc;
  new_array -> jl_gc_alloc;
  new_array -> jl_gc_managed_malloc;
  jl_gc_alloc_string -> jl_gc_pool_alloc_noinline;
  jl_gc_alloc_string -> jl_gc_big_alloc;
}

Note that from the callers' perspectives, the code is unchanged. The only difference is that one call path now has an additional single (cold) branch, which can be enabled for profiling:

    if (__unlikely(g_alloc_profile_enabled)) {
        _maybe_record_alloc_to_profile(val, size, jl_gc_unknown_type_tag);
    }

@vilterp
Copy link
Contributor Author

vilterp commented Jan 19, 2022

Discussion carried over from vilterp#17:
@vchuravy:

I would prefer keeping the same name here, and renaming the inner internal implementation, which then also shouldn't be exported (and ideally inlined into this function).

Maybe call that one then jl_gc_pool_alloc_impl

@NHDaly:

We talked about inlining it too, but we weren't sure about how to inline it only into this function..

If we mark that one inline, it'll end up inlining into the normal callers of it, though.. So the C functions (indicated in Pete's graph, above) that calls jl_gc_pool_alloc would now call jl_gc_pool_alloc_impl instead, and then it would be inlined.

So it'd become inlined into jl_gc_alloc_ and jl_gc_alloc_string. Maybe that's okay?

Or would you want us to create yet a third separate wrapper function that gets called by those? So we would have three functions:

  • jl_gc_pool_alloc_from_generated: called from generated code
  • jl_gc_pool_alloc: called from jl_gc_alloc_ and jl_gc_alloc_string
  • jl_gc_pool_alloc_impl: called from jl_gc_pool_alloc
    ? Or is that too weird? And the extra inlining is actually fine?

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon discussion, we don't think this is doable now, until we switch to OrcJIT v2 and have swappable function stub support available.

src/gc.c Outdated Show resolved Hide resolved
vilterp and others added 5 commits January 19, 2022 16:45
@vilterp
Copy link
Contributor Author

vilterp commented Jan 20, 2022

Thanks for the info @vtjnash. I think I see how OrcJIT v2 to would make things more elegant. To make sure I understand:

  • OrcJIT v2 will allow us to swap to an instrumented function without a wrapper, and without recompiling code?
  • The approach in this PR seems bad because it introduces an additional function call to a hot codepath?

What's the expected timeline on the OrcJIT v2 migration? Is there an issue we can follow?

We'd like to merge something in the meantime, if possible… Perhaps we could get rid of the wrapper function either by

  • adding a boolean parameter instrument_alloc to jl_gc_pool_alloc which is always true from generated code, and always false from internal usages, to avoid double counting when jl_gc_pool_alloc is called from paths that are already instrumented (jl_gc_alloc)
  • giving up on recording types for the time being, and only instrumenting the leaves of the call graph above (i.e. jl_gc_pool_alloc, jl_gc_managed_malloc, and jl_gc_big_alloc)

Does either of these approaches seem viable? Thanks.

@vchuravy
Copy link
Member

OrcJIT v2 will allow us to swap to an instrumented function without a wrapper, and without recompiling code?

Yeah that would be to goal. It might even be not to hard to create a prototype for that, and would be needed for an alternative backend to the probe points in #43453.

The approach in this PR seems bad because it introduces an additional function call to a hot codepath?

Yes, this is on the performance sensitive part of the allocator (and Jameson mentioned in offline discussion that we would like to inline the pool_alloc function fully). So any changes to that hot-path need to be performance neutral. Any additional call indirection is likely to be visible (hence my comment a while back that ideally we would need to callsite inline pool_alloc).

adding a boolean parameter instrument_alloc to jl_gc_pool_alloc

Might work, but I would like to see a performance study of that. i.e. we would like to avoid using an additional register xD

giving up on recording types for the time being, and only instrumenting the leaves of the call graph above

That could actually work, but is probably a lot less useful.

@NHDaly
Copy link
Member

NHDaly commented Jan 20, 2022

Cool, thanks for the explanation! Yeah, those concerns make sense. 👍 Definitely happy to work through the perf concerns, for sure 👍 👍

@vilterp
Copy link
Contributor Author

vilterp commented Jan 20, 2022

Note from discussion w/ @JeffBezanson and @NHDaly:

We're going to try the last approach I proposed:

giving up on recording types for the time being, and only instrumenting the leaves of the call graph above

In the meantime before OrcJIT v2, we'd prefer to have 100% of stacks, even without types, than missing 40% of allocations altogether. While types give a good overview of the workload, stacks give the essential info for drilling down and optimizing.

PR coming soon.

Edit: we don't want to give up on types just yet, so we're trying an approach (diagrammed above) based on instrumenting where we still have types when possible, using (inlined) wrapper functions.

@vilterp
Copy link
Contributor Author

vilterp commented Jan 25, 2022

Working on an allocations benchmark to try to capture perf overhead of this (and other approaches we might try). Initial results haven't shown a perf impact, although it may not be a sufficiently good benchmark: JuliaCI/BaseBenchmarks.jl#291 (comment)

vilterp and others added 2 commits January 29, 2022 14:47
* add a wrapper and inline into it

* review suggestions

Co-authored-by: Nathan Daly <nhdaly@gmail.com>
@vilterp
Copy link
Contributor Author

vilterp commented Jan 31, 2022

Btw, this appears to be the spot in the Go memory allocator where they sample the stack trace if allocation profiling is enabled: https://github.com/golang/go/blob/c27a3592aec5f46ae18f7fd3d9ba18e69e2f1dcb/src/runtime/malloc.go#L1166-L1173

Nothing fancy (no compile-time flag), just an if statement (that gets branch predicted) like we're doing here.

@vilterp
Copy link
Contributor Author

vilterp commented Jan 31, 2022

@nanosoldier runbenchmarks("alloc", vs = ":master")

@NHDaly
Copy link
Member

NHDaly commented Jan 31, 2022

I'm not sure if i have the permissions to kick off nanosoldier either... I haven't had great luck with it in the past. But I'll give it a shot:
@nanosoldier runbenchmarks("alloc", vs = ":master")

@NHDaly
Copy link
Member

NHDaly commented Jan 31, 2022

@nanosoldier runbenchmarks("alloc", vs = ":master")

@NHDaly
Copy link
Member

NHDaly commented Jan 31, 2022

Okay, i should have permissions now (thanks @vchuravy and @oscardssmith ).

@nanosoldier runbenchmarks("alloc", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@NHDaly
Copy link
Member

NHDaly commented Feb 2, 2022

Okay! :) Nanosoldier has run and showed no regressions! In fact, it showed that the allocations benchmark was slightly faster with this PR, which is of course nonsense, but it means this change is in the noise, as we predicted.

Thanks for pushing for the benchmarking, that makes me feel much better knowing we've measured the impact carefully! 😊

@vtjnash and @vchuravy can you give a final review now? Thanks very much!

src/gc.c Outdated Show resolved Hide resolved
Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @vilterp for pushing this forward. This is a nice, clean implementation. 👍
(I co-authored though, of course, so we still need another review.)

Thanks!

@vilterp vilterp mentioned this pull request Feb 3, 2022
29 tasks
NHDaly added a commit that referenced this pull request Feb 4, 2022
…cations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to #43868

Finishes fixing #43688 (gets the stacks, though not the types, of big allocs).
@NHDaly
Copy link
Member

NHDaly commented Feb 7, 2022

@vtjnash: We talked offline on slack last week and you said you weren't sure if you'd be able to get to reviewing this.

I'm going to merge it now, since we've had reviews from Valentin and talked it through with you in-person! Please still feel free to leave any review feedback you have, and we can always get to it in follow up PRs. ❤️ thanks again for your help! 😊

@vtjnash vtjnash merged commit 48b7d49 into JuliaLang:master Feb 7, 2022
NHDaly added a commit that referenced this pull request Feb 7, 2022
…cations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to #43868

Finishes fixing #43688 (gets the stacks, though not the types, of big allocs).

-----

- Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined
- add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner
    - replace all internal calls to jl_gc_big_alloc to call this instead
- add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner
    - This one is instrumented, and is meant to be called by generated
      code.
NHDaly added a commit that referenced this pull request Feb 8, 2022
…cations Profiler (#44043)

* Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to #43868

Finishes fixing #43688 (gets the stacks, though not the types, of big allocs).

-----

- Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined
- add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner
    - replace all internal calls to jl_gc_big_alloc to call this instead
- add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner
    - This one is instrumented, and is meant to be called by generated
      code.

Co-authored-by: Pete Vilter <7341+vilterp@users.noreply.github.com>
NHDaly added a commit that referenced this pull request Feb 10, 2022
…cations Profiler (#44043)

* Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to #43868

Finishes fixing #43688 (gets the stacks, though not the types, of big allocs).

-----

- Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined
- add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner
    - replace all internal calls to jl_gc_big_alloc to call this instead
- add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner
    - This one is instrumented, and is meant to be called by generated
      code.

Co-authored-by: Pete Vilter <7341+vilterp@users.noreply.github.com>
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request Feb 17, 2022
…cations Profiler (JuliaLang#44043)

* Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to JuliaLang#43868

Finishes fixing JuliaLang#43688 (gets the stacks, though not the types, of big allocs).

-----

- Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined
- add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner
    - replace all internal calls to jl_gc_big_alloc to call this instead
- add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner
    - This one is instrumented, and is meant to be called by generated
      code.

Co-authored-by: Pete Vilter <7341+vilterp@users.noreply.github.com>
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
…cations Profiler (JuliaLang#44043)

* Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to JuliaLang#43868

Finishes fixing JuliaLang#43688 (gets the stacks, though not the types, of big allocs).

-----

- Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined
- add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner
    - replace all internal calls to jl_gc_big_alloc to call this instead
- add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner
    - This one is instrumented, and is meant to be called by generated
      code.

Co-authored-by: Pete Vilter <7341+vilterp@users.noreply.github.com>
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
…cations Profiler (JuliaLang#44043)

* Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to JuliaLang#43868

Finishes fixing JuliaLang#43688 (gets the stacks, though not the types, of big allocs).

-----

- Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined
- add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner
    - replace all internal calls to jl_gc_big_alloc to call this instead
- add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner
    - This one is instrumented, and is meant to be called by generated
      code.

Co-authored-by: Pete Vilter <7341+vilterp@users.noreply.github.com>
NHDaly added a commit that referenced this pull request Apr 5, 2022
…cations Profiler (#44043)

* Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler

Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to
allow us to instrument it in the Allocations Profiler.

Followup to #43868

Finishes fixing #43688 (gets the stacks, though not the types, of big allocs).

-----

- Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined
- add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner
    - replace all internal calls to jl_gc_big_alloc to call this instead
- add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner
    - This one is instrumented, and is meant to be called by generated
      code.

Co-authored-by: Pete Vilter <7341+vilterp@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allocs profiler: extend to record Types for all allocations - (fix UnknownType in allocations results)
6 participants