Instrument GC with memory profiler implementation

This adds C support for a memory profiler within the GC, tracking locations of allocations, deallocations, etc... It operates in a similar manner as the time profiler with single large buffers setup beforehand through an initialization function, reducing the need for expensive allocations while the program being measured is running. The memory profiler instruments the GC in all locations that the GC statistics themselves are being modified (e.g. `gc_num.allocd` and `gc_num.freed`) by introducing new helper functions `jl_gc_count_{allocd,freed,reallocd}()`. Those utility functions call the `jl_memprofile_track_{de,}alloc()` method to register an address, a size and a tag with the memory profiler. We also track type information as this can be critically helpful when debugging, and to do so without breaking API guarantees we insert methods to set the type of a chunk of memory after allocating it where necessary. The tagging system allows the memory profiler to disambiguate, at profile time, between e.g. pooled allocations and the "big" allocator. It also allows the memory allocator to support tracking multiple "memory domains", e.g. a GPU support package could manually call `jl_memprofile_track_alloc()` any time a chunk of memory is allocated on the GPU so as to use the same system. By default, all values are tracked, however one can set a `memprof_tag_filter` value to track only the values you are most interested in. (E.g. only CPU domain big allocations) To disambiguate the memory and time profilers, we split them out into separate modules.
JuliaLang · Oct 4, 2019 · ec85f1d · ec85f1d
1 parent 588795e
commit ec85f1d
Show file tree

Hide file tree

Showing 21 changed files with 1,875 additions and 994 deletions.
diff --git a/doc/src/manual/profile.md b/doc/src/manual/profile.md
@@ -61,7 +61,7 @@ To see the profiling results, there is a [graphical browser](https://github.com/
 available, but here we'll use the text-based display that comes with the standard library:
 
 ```julia-repl
-julia> Profile.print()
+julia> Profile.Time.print()
 80 ./event.jl:73; (::Base.REPL.##1#2{Base.REPL.REPLBackend})()
  80 ./REPL.jl:97; macro expansion
   80 ./REPL.jl:66; eval_user_input(::Any, ::Base.REPL.REPLBackend)
@@ -130,7 +130,7 @@ collecting more samples:
 ```julia-repl
 julia> @profile (for i = 1:100; myfunc(); end)
 
-julia> Profile.print()
+julia> Profile.Time.print()
 [....]
  3821 ./REPL[1]:2; myfunc()
   3511 ./random.jl:431; rand!(::MersenneTwister, ::Array{Float64,3}, ::Int64, ::Type...
@@ -152,7 +152,7 @@ This illustrates the default "tree" dump; an alternative is the "flat" dump, whi
 counts independent of their nesting:
 
 ```julia-repl
-julia> Profile.print(format=:flat)
+julia> Profile.Time.print(format=:flat)
  Count File          Line Function
   6714 ./<missing>     -1 anonymous
   6714 ./REPL.jl       66 eval_user_input(::Any, ::Base.REPL.REPLBackend)
@@ -201,12 +201,12 @@ useful way to view the results.
 ## Accumulation and clearing
 
 Results from [`@profile`](@ref) accumulate in a buffer; if you run multiple pieces of code under
-[`@profile`](@ref), then [`Profile.print()`](@ref) will show you the combined results. This can
-be very useful, but sometimes you want to start fresh; you can do so with [`Profile.clear()`](@ref).
+[`@profile`](@ref), then [`Profile.Time.print()`](@ref) will show you the combined results. This can
+be very useful, but sometimes you want to start fresh; you can do so with [`Profile.Time.clear()`](@ref).
 
 ## Options for controlling the display of profile results
 
-[`Profile.print`](@ref) has more options than we've described so far. Let's see the full declaration:
+[`Profile.Time.print`](@ref) has more options than we've described so far. Let's see the full declaration:
 
 ```julia
 function print(io::IO = stdout, data = fetch(); kwargs...)
@@ -216,15 +216,15 @@ Let's first discuss the two positional arguments, and later the keyword argument
 
   * `io` -- Allows you to save the results to a buffer, e.g. a file, but the default is to print to `stdout`
     (the console).
-  * `data` -- Contains the data you want to analyze; by default that is obtained from [`Profile.fetch()`](@ref),
+  * `data` -- Contains the data you want to analyze; by default that is obtained from [`Profile.Time.fetch()`](@ref),
     which pulls out the backtraces from a pre-allocated buffer. For example, if you want to profile
     the profiler, you could say:
 
     ```julia
-    data = copy(Profile.fetch())
-    Profile.clear()
-    @profile Profile.print(stdout, data) # Prints the previous results
-    Profile.print()                      # Prints results from Profile.print()
+    data = copy(Profile.Time.fetch())
+    Profile.Time.clear()
+    @profile Profile.Time.print(stdout, data) # Prints the previous results
+    Profile.Time.print()                      # Prints results from Profile.Time.print()
     ```
 
 The keyword arguments can be any combination of:
@@ -233,7 +233,7 @@ The keyword arguments can be any combination of:
      with (default, `:tree`) or without (`:flat`) indentation indicating tree
      structure.
   * `C` -- If `true`, backtraces from C and Fortran code are shown (normally they are excluded). Try running the introductory
-    example with `Profile.print(C = true)`. This can be extremely helpful in deciding whether it's
+    example with `Profile.Time.print(C = true)`. This can be extremely helpful in deciding whether it's
     Julia code or C code that is causing a bottleneck; setting `C = true` also improves the interpretability
     of the nesting, at the cost of longer profile dumps.
   * `combine` -- Some lines of code contain multiple operations; for example, `s += A[i]` contains both an array
@@ -256,13 +256,13 @@ to save to a file using a wide `displaysize` in an [`IOContext`](@ref):
 
 ```julia
 open("/tmp/prof.txt", "w") do s
-    Profile.print(IOContext(s, :displaysize => (24, 500)))
+    Profile.Time.print(IOContext(s, :displaysize => (24, 500)))
 end
 ```
 
 ## Configuration
 
-[`@profile`](@ref) just accumulates backtraces, and the analysis happens when you call [`Profile.print()`](@ref).
+[`@profile`](@ref) just accumulates backtraces, and the analysis happens when you call [`Profile.Time.print()`](@ref).
 For a long-running computation, it's entirely possible that the pre-allocated buffer for storing
 backtraces will be filled. If that happens, the backtraces stop but your computation continues.
 As a consequence, you may miss some important profiling data (you will get a warning when that
@@ -271,8 +271,8 @@ happens).
 You can obtain and configure the relevant parameters this way:
 
 ```julia
-Profile.init() # returns the current settings
-Profile.init(n = 10^7, delay = 0.01)
+Profile.Time.init() # returns the current settings
+Profile.Time.init(n = 10^7, delay = 0.01)
 ```
 
 `n` is the total number of instruction pointers you can store, with a default value of `10^6`.

diff --git a/src/Makefile b/src/Makefile
@@ -45,7 +45,7 @@ RUNTIME_C_SRCS := \
 	simplevector runtime_intrinsics precompile \
 	threading partr stackwalk gc gc-debug gc-pages gc-stacks method \
 	jlapi signal-handling safepoint jloptions timing subtype \
-	crc32c
+	crc32c memprofile
 RUNTIME_SRCS := APInt-C runtime_ccall processor rtutils $(RUNTIME_C_SRCS)
 SRCS := $(RUNTIME_SRCS)
 

diff --git a/src/array.c b/src/array.c
@@ -98,6 +98,7 @@ static jl_array_t *_new_array_(jl_value_t *atype, uint32_t ndims, size_t *dims,
         tsz += tot;
         tsz = JL_ARRAY_ALIGN(tsz, JL_SMALL_BYTE_ALIGNMENT); // align whole object
         a = (jl_array_t*)jl_gc_alloc(ptls, tsz, atype);
+        jl_memprofile_set_typeof(a, atype);
         // No allocation or safepoint allowed after this
         a->flags.how = 0;
         data = (char*)a + doffs;
@@ -107,12 +108,14 @@ static jl_array_t *_new_array_(jl_value_t *atype, uint32_t ndims, size_t *dims,
     else {
         tsz = JL_ARRAY_ALIGN(tsz, JL_CACHE_BYTE_ALIGNMENT); // align whole object
         data = jl_gc_managed_malloc(tot);
+        jl_memprofile_set_typeof(data, atype);
         // Allocate the Array **after** allocating the data
         // to make sure the array is still young
         a = (jl_array_t*)jl_gc_alloc(ptls, tsz, atype);
         // No allocation or safepoint allowed after this
         a->flags.how = 2;
         jl_gc_track_malloced_array(ptls, a);
+        jl_memprofile_set_typeof(a, atype);
         if (!isunboxed || isunion)
             // need to zero out isbits union array selector bytes to ensure a valid type index
             memset(data, 0, tot);
@@ -334,7 +337,9 @@ JL_DLLEXPORT jl_array_t *jl_ptr_to_array_1d(jl_value_t *atype, void *data,
     if (own_buffer) {
         a->flags.how = 2;
         jl_gc_track_malloced_array(ptls, a);
-        jl_gc_count_allocd(nel*elsz + (elsz == 1 ? 1 : 0));
+        jl_gc_count_allocd(a, nel*elsz + (elsz == 1 ? 1 : 0),
+                           JL_MEMPROF_TAG_DOMAIN_CPU | JL_MEMPROF_TAG_ALLOC_STDALLOC);
+        jl_memprofile_set_typeof(a, atype);
     }
     else {
         a->flags.how = 0;
@@ -401,7 +406,9 @@ JL_DLLEXPORT jl_array_t *jl_ptr_to_array(jl_value_t *atype, void *data,
     if (own_buffer) {
         a->flags.how = 2;
         jl_gc_track_malloced_array(ptls, a);
-        jl_gc_count_allocd(nel*elsz + (elsz == 1 ? 1 : 0));
+        jl_gc_count_allocd(a, nel*elsz + (elsz == 1 ? 1 : 0),
+                           JL_MEMPROF_TAG_DOMAIN_CPU | JL_MEMPROF_TAG_ALLOC_STDALLOC);
+        jl_memprofile_set_typeof(a, atype);
     }
     else {
         a->flags.how = 0;
@@ -669,6 +676,7 @@ static int NOINLINE array_resize_buffer(jl_array_t *a, size_t newlen)
             a->flags.how = 1;
             jl_gc_wb_buf(a, a->data, nbytes);
         }
+        jl_memprofile_set_typeof(a->data, jl_typeof(a));
     }
     if (JL_ARRAY_IMPL_NUL && elsz == 1 && !isbitsunion)
         memset((char*)a->data + oldnbytes - 1, 0, nbytes - oldnbytes + 1);