Description
This came up in the context of #22210, where I'm noticing a big performance hit on transpose
for sparse matrices. A convenient test case comes from copying these lines to a separate file, and annotating _computecolptrs_halfperm!
with @noinline
(not strictly necessary since it doesn't inline on master) and then comparing the result of using either @noinline
or @inline
on _distributevals_halfperm!
.
Demo:
A = sprand(600, 600, 0.01);
X = transpose(A);
using BenchmarkTools
With @inline
on _distributevals_halfperm!
:
julia> @benchmark halfperm!($X, $A, $(1:A.n), $(identity)) seconds=1
BenchmarkTools.Trial:
memory estimate: 166.98 KiB
allocs estimate: 10685
--------------
minimum time: 921.938 μs (0.00% GC)
median time: 936.064 μs (0.00% GC)
mean time: 954.923 μs (0.40% GC)
maximum time: 1.627 ms (38.60% GC)
--------------
samples: 1046
evals/sample: 1
With @noinline
on _distributevals_halfperm!
:
julia> @benchmark halfperm!($X, $A, $(1:A.n), $(identity)) seconds=1
BenchmarkTools.Trial:
memory estimate: 64 bytes
allocs estimate: 2
--------------
minimum time: 23.175 μs (0.00% GC)
median time: 23.390 μs (0.00% GC)
mean time: 23.658 μs (0.00% GC)
maximum time: 52.727 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Inspection does not suggest an immediate reason for this 40x performance gap; profiling places all the blame at this line with the function evaluation. It made me wonder whether there is some problem inlining the function call.
However, the truly bizarre part is that, with @inline
, @code_llvm _distributevals_halfperm!(X, A, 1:A.n, identity)
is, for all practical purposes that I can see, identical to @code_llvm halfperm!(X, A, 1:A.n, identity)
(aside from the obvious call to _computecolptrs_halfperm!
). I am not at all good at reading assembly, but even there the differences do not seem dramatic to me (there are some constant differences to movq
statements that might be problematic?).
This seems really puzzling. LLVM bug? Present at least on 0.6.0-rc3 and master.