Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory profiler (continuation of #31534) #33467

Closed
wants to merge 7 commits into from
Closed

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Oct 4, 2019

Rebase of #31534. Putting it in a separate PR because the rebase was pretty painful and stuff might have broken (notably the profile buffer overflow handling).

I reverted a couple of renames and generally tried to minimize the diff to make it easier to rebase, but we can now continue any API rework on this branch of course.

cc @staticfloat

@maleadt maleadt force-pushed the tb/memprofiler branch 9 times, most recently from ec85f1d to 884c28e Compare October 4, 2019 13:43
base/error.jl Outdated Show resolved Hide resolved
src/gc.c Outdated
// so that things like Interpreter frame objects do not disappear.
jl_gc_queue_bt_buf(gc_cache, &sp, ptls2->bt_data, ptls2->bt_size);
jl_gc_queue_bt_buf(gc_cache, &sp, (uintptr_t *)jl_profile_get_data(), jl_profile_len_data());
jl_gc_queue_bt_buf(gc_cache, &sp, (uintptr_t *)jl_memprofile_get_bt_data(), jl_memprofile_len_bt_data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done for every thread — should be moved out of the loop?

Copy link
Member

@c42f c42f Oct 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we're marking the profile data here as well as the memprofile data.

How big does the profile data buffer get? Should be worried about a performance hit here?

@maleadt maleadt force-pushed the tb/memprofiler branch 7 times, most recently from 5fa9c30 to 200452c Compare October 7, 2019 15:06
@timholy
Copy link
Member

timholy commented Oct 7, 2019

Has this addressed review comments on #31534?

@maleadt
Copy link
Member Author

maleadt commented Oct 8, 2019

Has this addressed review comments on #31534?

Not yet, there are some issues with the PR that I've been focusing on first (recent profiler/backtrace changes, memory corruption with realloc, gc analyzer false positives).

@timholy
Copy link
Member

timholy commented Oct 8, 2019

Sounds fine. Feel free to ping me if you want a re-review once that's done.


julia> @profile myfunc()
julia> Time.@profile myfunc()
Copy link
Member

@KristofferC KristofferC Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can do this change so we should probably have a backwards compatible API. Something like

Profile.@memprofile
Profile.memprint

maybe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was just a quick fix to have all the functionality exposed. I'll have a look at a better api, maybe @profile target=:memory ex and have the other functions dispatch to the correct module based on the data that is being gathered.

src/gc.c Outdated
@@ -938,10 +940,44 @@ void jl_gc_track_malloced_array(jl_ptls_t ptls, jl_array_t *a) JL_NOTSAFEPOINT
ptls->heap.mallocarrays = ma;
}

void jl_gc_count_allocd(size_t sz) JL_NOTSAFEPOINT
void jl_gc_count_allocd(void * addr, size_t sz, uint16_t tag) JL_NOTSAFEPOINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in the hot path, should be at least a static inline function.

src/memprofile.c Outdated
return memprof_alloc_data_size_max;
}

JL_DLLEXPORT int jl_memprofile_running(void) JL_NOTSAFEPOINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one.

@maleadt maleadt force-pushed the tb/memprofiler branch 3 times, most recently from 27688c9 to 179cdc8 Compare October 9, 2019 09:47
@maleadt
Copy link
Member Author

maleadt commented Oct 18, 2019

Reworked the C parts of the profiler based on the generalized backtrace format (left the stdlib changes out for now, while trying to share some more code and data between both profilers)
Maybe a call for comments by @c42f since you touched/wrote much of that code:

Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a somewhat cursory look over this. I like how you've used the backtrace buffers for serializing the allocation info; I think that makes sense. My only reservation there is whether scanning those buffers for roots will become expensive and whether there's anything we can do about it. How big do the buffers get in a typical profiling run?

Regarding the other implementation detail, it strikes me that writing the type of the allocation info is curiously non-atomic compared to writing the other data (cf, jl_memprofile_set_typeof vs jl_memprofile_track_alloc). I guess this falls out of how we currently do codegen for allocation? However I do think it's a potential sign of "trouble" (performance or code size impact) that this needs changes to codegen at all; it seems like the best outcome would be to have all the memory profiling stuff hidden away behind a single if (__unlikely(jl_memprofile_is_running())) inside some runtime GC function call(s) which are already emitted. Is that possible with some rearrangement?

@@ -119,8 +167,11 @@ end
Get the backtrace of the current exception, for use within `catch` blocks.
"""
function catch_backtrace()
bt, bt2 = ccall(:jl_get_backtrace, Ref{SimpleVector}, ())
return _reformat_bt(bt::Vector{Ptr{Cvoid}}, bt2::Vector{Any})
bt1, bt2 = ccall(:jl_get_backtrace, Ref{SimpleVector}, ())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Note that bt2 is actually redundant here because the elements of bt2 are rooted in the task's current catch stack until the current catch block exists. So we could actually change jl_get_backtrace to remove bt2 entirely. (Could be a separate PR though. I'd be happy to review a PR which just did this and some of the _reformat_bt changes and get that done in short order if you felt like pulling those out separately.)

ret = Vector{Union{InterpreterIP,Ptr{Cvoid}}}()
i, j = 1, 1
struct AllocationInfo
T::Union{Nothing,Type}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could call this field type?

// 2.3. mark any managed objects in the backtrace buffers,
// so that things like Interpreter frame objects do not disappear.
jl_gc_queue_bt_buf(gc_cache, &sp, ptls2->bt_data, ptls2->bt_size);
jl_gc_queue_bt_buf(gc_cache, &sp, (jl_bt_element_t *)jl_profile_get_data(), jl_profile_len_data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profile data is global right? This line should go outside the jl_n_threads loop.

#ifdef LIBOSXUNWIND
size_t rec_backtrace_ctx_dwarf(jl_bt_element_t *bt_data, size_t maxsize, bt_context_t *ctx, int add_interp_frames) JL_NOTSAFEPOINT;
size_t rec_backtrace_ctx_dwarf(jl_bt_element_t *bt_data, size_t *bt_size, size_t maxsize,
bt_context_t *ctx, int add_interp_frames) JL_NOTSAFEPOINT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Should return an int as a flag by convention, rather than size_t I think.



//
// Shared infrastructure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see this consolidated.

Could do with a comment about where the non-shared stuff is kept. (I assume in the signals-*.c?)

bt_entry[5].uintptr = allocsz;
// Used to "tag" this allocation within a particular domain (CPU, GPU, other)
// or within a particular allocator (Pool, std, malloc), or as a free instead.
bt_entry[6].uintptr = tag;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be more natural to put all this into the buffer ahead of the actual backtrace? That way you know that the following backtrace refers to an allocation when reading through the buffer later. (Or maybe this is known implicitly?)

(Edit: oh, I see — jl_memprofile_set_typeof() needs access to bt_entry[2].jlvalue. That does seem kind of unfortunate but I haven't read enough of what's here to know whether that's necessary.)

// The location of the data in memory, used to match allocations with deallocations.
bt_entry[3].uintptr = (uintptr_t) v;
// The time at which this happened
bt_entry[4].uintptr = jl_clock_now(); // FIXME: double to a potentially 32-bit uintptr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you'll need to pack it into one word (#ifdef _P64) or two otherwise. I think that will do for now, but it does emphasize that we're drifting toward using the "backtrace" buffer for somewhat general purpose (but gc-aware) serialization. I don't think that's bad but it's food for thought.

//

// data structures
volatile jl_bt_element_t *bt_data_prof = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that allocation can be done in parallel we'll presumably need a lock around this profile data buffer for memory profiling. This is different from the performance profiling case which uses sampling and only touches the profile data buffer from the signal handler thread.

(Alternatively we could use separate memory profile data buffers per thread which could be merged at the end, at the cost of storage. Might be fine though if we never initialize that memory so that it can be backed lazily by the OS page table?)

assert(bt_size_cur < bt_size_max);
}

JL_DLLEXPORT void jl_memprofile_set_typeof(void * v, void * ty) JL_NOTSAFEPOINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that jl_memprofile_set_typeof is done separately from jl_memprofile_track_alloc makes this non-atomic with respect to any locking scheme we want to use within these functions. TBH that seems likely to lead to problems and it would be great to figure out how to make writing the allocation trace and info more atomic.

@c42f
Copy link
Member

c42f commented Nov 2, 2019

My only reservation there is whether scanning those buffers for roots will become expensive and whether there's anything we can do about it.

Speculation: the buffer is append-only during a profiling run so I guess we shouldn't need to sweep the whole thing except during a full collection. Instead maintain an index to the first young object within the buffer, or some such thing.

staticfloat and others added 7 commits November 27, 2019 14:25
Also adds `bt_overflow` flag instead of spitting out messages in the
middle of profiling, to be used by client profiling code.

This change allows for better checking of stack frames that could be
incomplete due to insufficient backtrace buffer space.  Realistically, a
single truncated stack trace in the case of a sampling time profiler is
unlikely to create large problems.  However when taking backtraces for
things such as a memory profiler, it is critical that all backtraces be
accurate, and so we allow client code to be somewhat stricter here.
This adds C support for a memory profiler within the GC, tracking
locations of allocations, deallocations, etc...  It operates in a
similar manner as the time profiler with single large buffers setup
beforehand through an initialization function, reducing the need for
expensive allocations while the program being measured is running.

The memory profiler instruments the GC in all locations that the GC
statistics themselves are being modified (e.g. `gc_num.allocd` and
`gc_num.freed`) by introducing new helper functions
`jl_gc_count_{allocd,freed,reallocd}()`.  Those utility functions call
the `jl_memprofile_track_{de,}alloc()` method to register an address,
a size and a tag with the memory profiler.  We also track type
information as this can be critically helpful when debugging, and to do
so without breaking API guarantees we insert methods to set the type of
a chunk of memory after allocating it where necessary.

The tagging system allows the memory profiler to disambiguate, at
profile time, between e.g. pooled allocations and the "big" allocator.
It also allows the memory allocator to support tracking multiple "memory
domains", e.g. a GPU support package could manually call
`jl_memprofile_track_alloc()` any time a chunk of memory is allocated on
the GPU so as to use the same system.  By default, all values are
tracked, however one can set a `memprof_tag_filter` value to track only
the values you are most interested in.  (E.g. only CPU domain big
allocations)
Maybe makes it possible to get rid of set_typeof call in generated code.
@quinnj
Copy link
Member

quinnj commented Jun 16, 2020

It'd be great to have this; I'm happy to help test/push this along as much as I can if no one else can (who originally worked on it). But if someone could at least rebase, that'd be nice, since it seems there have been some backtrace changes since the changes proposed here.

@maleadt
Copy link
Member Author

maleadt commented Jun 18, 2020

Yeah sorry for leaving this PR in a bad state -- I didn't like the codegen part (ideally there shouldn't be any profiler-related code being generated all the time), and the stdlib would need some very careful redesign to not break the API of the current time profiler without just duplicating all code into a memory profiler. At that point I had found my memory leak so ran out of reasons to look at this 🙂 I can push the WIP stdlib if you're interested in reviving this though.

@c42f
Copy link
Member

c42f commented Jun 22, 2020

I didn't like the codegen part (ideally there shouldn't be any profiler-related code being generated all the time)

Agreed. One thought I had was that perhaps we could treat this more like code coverage: just start julia in a special mode where codegen inserts memory profiling code during codegen. To be honest I don't think this is ideal — better to have memory profiling dynamically available — but it would make the whole thing a lot less intrusive to normal usage and allow us to merge something soon with less concern about performance impact.

@IanButterworth
Copy link
Member

IanButterworth commented Mar 8, 2021

+1 to this being a great feature to have

Does anyone recall if this was in a working state? I tried to make on my MacOS machine but failed, as CI did

@vilterp vilterp mentioned this pull request Sep 22, 2021
29 tasks
@timholy timholy mentioned this pull request Oct 15, 2021
4 tasks
@vilterp vilterp mentioned this pull request Oct 22, 2021
7 tasks
NHDaly added a commit that referenced this pull request Jan 19, 2022
## Overview

Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.

Alternate approach to existing alloc profiler PR: #33467
Complementary to garbage profiler PR: #42658 (maybe there's some nice way to meld them)

This may be reinventing the wheel from #33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.

## Usage:

```julia
using Profile.Allocs
res = Allocs.@Profile sample_rate=0.001 my_func()
prof = Allocs.fetch()
# do something with `prof`
```

See also: JuliaPerf/PProf.jl#46 for support for visualizing these.

Co-authored-by: Nathan Daly <nhdaly@gmail.com>
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
## Overview

Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.

Alternate approach to existing alloc profiler PR: JuliaLang#33467
Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them)

This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.

## Usage:

```julia
using Profile.Allocs
res = Allocs.@Profile sample_rate=0.001 my_func()
prof = Allocs.fetch()
# do something with `prof`
```

See also: JuliaPerf/PProf.jl#46 for support for visualizing these.

Co-authored-by: Nathan Daly <nhdaly@gmail.com>
@ViralBShah
Copy link
Member

Given the recent memory profiling work - should we close this?

@maleadt maleadt closed this Feb 25, 2022
@giordano giordano deleted the tb/memprofiler branch February 26, 2022 02:54
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
## Overview

Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.

Alternate approach to existing alloc profiler PR: JuliaLang#33467
Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them)

This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.

## Usage:

```julia
using Profile.Allocs
res = Allocs.@Profile sample_rate=0.001 my_func()
prof = Allocs.fetch()
# do something with `prof`
```

See also: JuliaPerf/PProf.jl#46 for support for visualizing these.

Co-authored-by: Nathan Daly <nhdaly@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants