Description
openedon Aug 28, 2018
Hi all,
It's great to see all the work that's gone into the linear algebra support for 1.0! I've recently noticed one small area where I think significant performance is being left on the table. Namely, multiplication of sparse matrices with diagonal matrices. This was also mentioned last month on discourse. Multiplication between diagonal matrices and sparse matrices is currently defined at:
julia/stdlib/SparseArrays/src/linalg.jl
Lines 129 to 136 in 255030e
these methods seem to fall back to pretty generic matrix multiplication methods and not to any of the specialized methods defined in stdlib/LinearAlgebra/src/diagonal.jl. I've written a couple of potential replacements. The versions for right multiplying a sparse matrix by a diagonal matrix are benchmarked below. Please let me know if there is a reason why something similar hasn't been implemented already. If it's just an oversight then I can prepare a PR to add in the optimized methods. The benchmarking was run on a fresh clone of master from this morning
julia> versioninfo()
Julia Version 1.1.0-DEV.129
Commit 59c7cedfc4 (2018-08-28 15:47 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libimf
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
To benchmark I did the following. I wrote the following script and saved it to a file bnchmrkDiag.jl
using LinearAlgebra,SparseArrays,BenchmarkTools
# Optimized matrix matrix product. Underlying in-place multiplication is non-allocating
function MTimesDiag( A::SparseMatrixCSC, D::Diagonal )
T = Base.promote_op(*, eltype(D), eltype(A))
C = LinearAlgebra.copy_oftype(A, T)
MTimesDiag!(C,D)
end
function MTimesDiag!( A::SparseMatrixCSC, D::Diagonal )
n = size(A,2)
if (length(D.diag) != n)
error("Incompatible sizes")
end
for ir = 1:n
j1 = A.colptr[ir]
j2 = A.colptr[ir+1] - 1
dgir = D.diag[ir]
for ic = j1:j2
A.nzval[ic] *= dgir
end
end
return A
end
# Alternative implementation using rmul!. It's much better than what you currently get
# with * but doesn't compete with the above method.
function MTimesDiagRmul(A::SparseMatrixCSC, D::Diagonal)
T = Base.promote_op(*, eltype(D), eltype(A))
C = LinearAlgebra.copy_oftype(A, T)
rmul!(C, D)
end
n = 1000;
B = sprandn(n,n,1e-3);
#n = 100000;
#B = sprandn(n,n,5e-4);
D = Diagonal(randn(n));
# Allocating versions
b1 = @benchmarkable $B * $D
b2 = @benchmarkable MTimesDiagRmul($B,$D)
b3 = @benchmarkable MTimesDiag($B,$D)
#in place versions
b4 = @benchmarkable rmul!(C,$D) setup=(C = copy($B))
b5 = @benchmarkable MTimesDiag!(C,$D) setup=(C = copy($B))
Then from the REPL I did
julia> include("bnchmrkDiag.jl")
Benchmark(evals=1, seconds=5.0, samples=10000)
julia> run(b1)
BenchmarkTools.Trial:
memory estimate: 24.13 KiB
allocs estimate: 10
--------------
minimum time: 750.135 ms (0.00% GC)
median time: 755.622 ms (0.00% GC)
mean time: 756.598 ms (0.00% GC)
maximum time: 763.333 ms (0.00% GC)
--------------
samples: 7
evals/sample: 1
julia> run(b2)
BenchmarkTools.Trial:
memory estimate: 120.09 KiB
allocs estimate: 30
--------------
minimum time: 36.090 μs (0.00% GC)
median time: 38.475 μs (0.00% GC)
mean time: 50.252 μs (17.26% GC)
maximum time: 41.074 ms (99.82% GC)
--------------
samples: 10000
evals/sample: 1
julia> run(b3)
BenchmarkTools.Trial:
memory estimate: 23.80 KiB
allocs estimate: 4
--------------
minimum time: 3.568 μs (0.00% GC)
median time: 4.060 μs (0.00% GC)
mean time: 10.303 μs (55.05% GC)
maximum time: 41.750 ms (99.95% GC)
--------------
samples: 10000
evals/sample: 1
julia> run(b4)
BenchmarkTools.Trial:
memory estimate: 96.30 KiB
allocs estimate: 26
--------------
minimum time: 34.150 μs (0.00% GC)
median time: 36.624 μs (0.00% GC)
mean time: 45.483 μs (16.58% GC)
maximum time: 41.385 ms (99.83% GC)
--------------
samples: 10000
evals/sample: 1
julia> run(b5)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.841 μs (0.00% GC)
median time: 2.016 μs (0.00% GC)
mean time: 2.137 μs (0.00% GC)
maximum time: 21.772 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
As you can see the simple B * D
is incredibly slow compared to the others. As you would expect, the differences become more significant for larger matrices. I re ran the benchmarks with
n = 100000;
B = sprandn(n,n,5e-4);
For reference, that gave nnz(B)/n = 50.00124
. I got impatient and killed the B * D
benchmark after about a minute. The results for the other methods were
julia> run(b2)
BenchmarkTools.Trial:
memory estimate: 160.93 MiB
allocs estimate: 48
--------------
minimum time: 90.440 ms (15.96% GC)
median time: 123.624 ms (38.25% GC)
mean time: 110.487 ms (29.57% GC)
maximum time: 139.199 ms (41.40% GC)
--------------
samples: 46
evals/sample: 1
julia> run(b3)
BenchmarkTools.Trial:
memory estimate: 77.06 MiB
allocs estimate: 7
--------------
minimum time: 36.896 ms (10.34% GC)
median time: 37.501 ms (10.50% GC)
mean time: 38.607 ms (12.12% GC)
maximum time: 89.270 ms (62.05% GC)
--------------
samples: 130
evals/sample: 1
julia> run(b4)
BenchmarkTools.Trial:
memory estimate: 83.88 MiB
allocs estimate: 41
--------------
minimum time: 47.182 ms (1.35% GC)
median time: 52.857 ms (9.70% GC)
mean time: 53.374 ms (11.09% GC)
maximum time: 88.956 ms (46.19% GC)
--------------
samples: 46
evals/sample: 1
julia> run(b5)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.496 ms (0.00% GC)
median time: 4.910 ms (0.00% GC)
mean time: 4.945 ms (0.00% GC)
maximum time: 6.768 ms (0.00% GC)
--------------
samples: 128
evals/sample: 1
You see about the same thing for left multiplication by a diagonal matrix.
Thanks, Patrick