-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Memory Profiler #31534
[WIP] Memory Profiler #31534
Conversation
Must do a blog post to announce this as well! |
Does this replace the |
Great writeup! Could just be the text of the blog post. Since this is a diagnostic tool, I think there's a little more leeway for API breakage for what it's worth. |
of an allocated and then freed piece of memory. This function has also been known to | ||
create highly competitive dota agents. | ||
""" | ||
function open_AI(a::allocation_info_t, alloc_stacktrace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just extend Base.open
/close
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I don't really like that kind of overloading. It feels like a type pun, there's nothing about this kind of "opening" and "closing" that is similar to Base.open
or Base.close
, and so there's not reason for things like docstrings, method tables, etc... to be shared between them.
There is definitely some overlap, but I don't fully grok what kinds of memory the I'm not sure what the use cases are for |
Also adds `bt_overflow` flag instead of spitting out messages in the middle of profiling, to be used by client profiling code. This change allows for better checking of stack frames that could be incomplete due to insufficient backtrace buffer space. Realistically, a single truncated stack trace in the case of a sampling time profiler is unlikely to create large problems. However when taking backtraces for things such as a memory profiler, it is critical that all backtraces be accurate, and so we allow client code to be somewhat stricter here.
This adds C support for a memory profiler within the GC, tracking locations of allocations, deallocations, etc... It operates in a similar manner as the time profiler with single large buffers setup beforehand through an initialization function, reducing the need for expensive allocations while the program being measured is running. The memory profiler instruments the GC in all locations that the GC statistics themselves are being modified (e.g. `gc_num.allocd` and `gc_num.freed`) by introducing new helper functions `jl_gc_count_{allocd,freed,reallocd}()`. Those utility functions call the `jl_memprofile_track_{de,}alloc()` method to register an address, a size and a tag with the memory profiler. We also track type information as this can be critically helpful when debugging, and to do so without breaking API guarantees we insert methods to set the type of a chunk of memory after allocating it where necessary. The tagging system allows the memory profiler to disambiguate, at profile time, between e.g. pooled allocations and the "big" allocator. It also allows the memory allocator to support tracking multiple "memory domains", e.g. a GPU support package could manually call `jl_memprofile_track_alloc()` any time a chunk of memory is allocated on the GPU so as to use the same system. By default, all values are tracked, however one can set a `memprof_tag_filter` value to track only the values you are most interested in. (E.g. only CPU domain big allocations) To disambiguate the memory and time profilers, we split them out into separate modules.
Co-Authored-By: staticfloat <staticfloat@gmail.com>
9e476b3
to
a32f2a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the re-use of the time-profiling infrastructure here, very nice design.
It's been ages since I poked around the gc, and some time since I thought about profiling, so I am not perfectly suited to give feedback here. But overall I think the API needs to be a bit more layered and designed with "readout performance" in mind. I haven't played with this, but I wouldn't be surprised if just "collecting" memprofile data can take a long time; IIUC (and I may not) I think some relatively simple steps could drastically shorten that time.
EDIT: to expand on that, if you have m
allocations that occur with the same backtrace of length n
instruction pointers, this implementation is O(m*n)
in both memory and time, and the construction of the stacktraces (with all the lookup) will be the slow step. If instead you aggregate into sets of unique backtraces (using just the instruction pointers, prior to the lookup) then you get something that's O(m+n)
in the slow step. You can modify your AllocationInfo
to allow list(s) of event times and allocation sizes so as to aggregate efficiently.
// Tags applied to memory allocations to specify which domain the memory is | ||
// stored on, and also which "kind" of memory allocator was used. | ||
// When filtering, a filter tag value of `0xffff` means "accept everything". | ||
// We support the "CPU", "GPU" and "External" (e.g. "other") domains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does mmap
fall in this scheme?
} | ||
|
||
// Next up; store allocation/type information | ||
memprof_alloc_data[memprof_alloc_data_size].memory_location = v; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe hoist the indexing operation for slightly faster performance?
@@ -0,0 +1,9 @@ | |||
# Compatibility shims for old users who aren't used to the `Time` and `Memory` sub-modules | |||
function init(args...; kwargs...) | |||
@warn("Profile.init() is deprecated, use Profile.Time.init() or Profile.Memory.init() directly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry this much scoping will be irritating to users. How about transitioning to an all-keyword interface?
There's also some concern about deprecations, I have the impression we're trying to avoid those. If so, it might be necessary to preserve the current interface as-is and just add to it.
|
||
Build a memory profile tag filter that will capture only events matching the given | ||
filter parameters. This method is meant to mirror the `#define`'d `JL_MEMPROF_TAG_*` | ||
values within `julia_internal,.h`. Valid memory domains are `:cpu`, `:gpu`, `:external` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better not to encourage users to have to open up julia_internal.h
, better just to give the valid values here (and then you can add the more detailed info about correspondence as a comment somewhere).
Create an `AllocationInfo` that represents an allocated (and not yet deallocated) chunk | ||
of memory. This `AllocationInfo` will later be "closed" to represent the full lifecycle | ||
of an allocated and then freed piece of memory. This function has also been known to | ||
create highly competitive dota agents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, a chemistry reference 😉 https://en.wikipedia.org/wiki/DOTA_(chelator)
Less frivolously, I think this design encourages serious performance bottlenecks, see comments below in read_and_coalesce_memprofile_data
.
""" | ||
read_and_coalesce_memprofile_data() | ||
|
||
Load in the backtrace and allocation data buffers from the C side of Julia-land, returns two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I count 3. Spell out the meaning of ghost_chunks
too.
@info("Opening $(chunk_id) $(a.T) $(a.allocsz)") | ||
|
||
# Assert that we're not inserting an identical chunk | ||
@assert !(chunk_id in keys(open_chunks_map)) "Doubly-opened memory chunk!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I thought, "can't this happen just from re-use of pooled memory?" Needs a comment re the deletion below.
|
||
# Any leftover "open" chunks are just plain old leaked memory. | ||
open_chunks = sort(collect(values(open_chunks_map)); by = x -> x.alloc_time) | ||
closed_chunks = sort(closed_chunks, by = x -> x.alloc_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort!
? (for both)
|
||
# If this `a` represents a free(), let's see if we're closing a previously opened chunk. | ||
if !(chunk_id in keys(open_chunks_map)) | ||
push!(ghost_chunks, (a, bt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't ghosts, because ghosts were once alive 😄 More like "lost souls"
get_memprofile_tag_filter() = UInt16(ccall(:jl_memprofile_tag_filter, Cint, ())) | ||
|
||
""" | ||
read_and_coalesce_memprofile_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an API this feels awkward. The name suggests too much is being crammed into one function. If some users will only want detail on the "open" allocations, then the time spent constructing stacktraces for all the closed events will be a huge unnecessary load. Might be best to (1) parcel into types, (2) look for duplicates in the backtraces, and (3) look up only unique backtraces.
void jl_gc_count_allocd(void * addr, size_t sz, uint16_t tag) JL_NOTSAFEPOINT | ||
{ | ||
#ifdef JULIA_ENABLE_THREADING | ||
jl_atomic_fetch_add(&gc_num.allocd, sz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't. This kills performance. The TODO in the code was wrong and this should be made thread local instead. The function that want to read these values is responsible for adding everything up. The only thread synchronization needed for the allocation thread should be to use relaxed atomic load and store and a normal increment.
For my application, I needed to know the resident memory high-water mark and processing time in order to do resource/workload planning.
What I discovered using the hack above was helpful to me. On my 32GB box, the performance is quick till I hit about 5GB, at which point the GC kicked in to keep my memory usage down (but increasing processing time, accounting for 50-80% of time). I could use this hack to see just how much cpu and resident memory I need per unit of work, so that I could do resource planning. Moreover, I was surprised to see just how much GC costs are for my workloads. This also let me speculate just how important vectored representations of fixed size data could be with regard to garbage collection costs; e.g. Vector{SubString} as pointers into a huge memory mapped JSON file rather than thousands of small strings. As a side note, I didn't see how I could open a process and have its output buffered; hence, I used a temporary file. Moreover, I didn't see how I could easily fork() the process, to have one procss monitor memory and the other run the computation; hence, I used bash. |
Overview
This branch adds a Memory Profiler to the
Profile
standard library. Accordingly, it re-arranges things such that the time profile code is stored withinProfile.Time
(but@profile
and friends are exported, so it should be a mostly non-breaking change).Memory Profiler internals
The memory profiler works much the same as the time profiler, it creates large buffers to store event backtraces as well as metadata about the allocation/deallocation. These large buffers are later examined from Julia code and timelines are reconstructed to pair allocations and deallocations appropriately to determine object lifetime, objects that are left "live" at the end of the instrumented region, etc...
Allocations are coarsely classified based upon their allocation domain (
CPU
,GPU
,External
, with room for this to grow) and their allocator (big
,std
orpool
, relating to the three allocation code paths within Julia). Because a user may be primarily interested in certain allocations and not at all interested in others (and because Julia may generate many, many allocations over the course of a computation) a rudimentary filtering system exists, where the user may set up a bitmask for domain and allocator, allowing only certain events to be collected. Note that at the time of writing, only theCPU
domain is tested.The recorded allocation information (represented in Julia below):
address
represents the pointer to the object that was allocated. If it remains live at the end of the computation, the user can actually inspect the object. If it is no longer live, accessing that memory is a fun way to randomly crash Julia.T
represents the type of the object that was allocated. This value (unless it isC_NULL
) is always valid to dereference, as types are never destroyed. Type information is possibly unavailable, however this should only be the case when an external program is directly calling Julia's GC functions and is unaware of methods such asjl_memprofile_set_typeof(addr, T)
.time
is the wallclock time at which this event occurred.allocsz
is the number of bytes this allocation was made for.tag
is the or-wise combination of domain, allocator, and whether this event was an allocation or a deallocation.Julia-facing API
The Julia side of things revolves around the macro
@memprofile ex
which runsgc()
, turns on memory profiling, runs then the enclosed expression,gc()
's again, and finally turns off memory profiling. Once this has been done, the user can analyze the buffers by runningProfile.Memory.read_and_coalesce_memprofile_data()
, which returns three values:open_chunks
,closed_chunks
andghost_chunks
, where an "open" chunk is one that was allocated but never freed, a "closed" chunk is one that was allocated and freed, and a "ghost" chunk is one that was freed but never allocated.These outputs are all "flat" in that there is no collation of results across backtrace locations, or anything like that. There are a host of interesting visualizations and data output formats that could be made upon this information, I just do not have time to do it right now.
TODO
Integrate
flat
->tree
tools for both Time and Memory profilers. The time profiler already contains some tools to convert the "flat" data arrays into a "tree" format that is substantially more readable and useful. We should be able to abstract that out a bit and create a useful "flat" -> "tree" collator that can deal with both of these kinds of data.Add
NEWS.md
once the user-facing API is stabilized.