Skip to content

Multiplication of sparse matrices with diagonal matrices is slow (fix suggested) #28934

Closed

Description

Hi all,
It's great to see all the work that's gone into the linear algebra support for 1.0! I've recently noticed one small area where I think significant performance is being left on the table. Namely, multiplication of sparse matrices with diagonal matrices. This was also mentioned last month on discourse. Multiplication between diagonal matrices and sparse matrices is currently defined at:

function (*)(D::Diagonal, A::SparseMatrixCSC)
T = Base.promote_op(*, eltype(D), eltype(A))
mul!(LinearAlgebra.copy_oftype(A, T), D, A)
end
function (*)(A::SparseMatrixCSC, D::Diagonal)
T = Base.promote_op(*, eltype(D), eltype(A))
mul!(LinearAlgebra.copy_oftype(A, T), A, D)
end

these methods seem to fall back to pretty generic matrix multiplication methods and not to any of the specialized methods defined in stdlib/LinearAlgebra/src/diagonal.jl. I've written a couple of potential replacements. The versions for right multiplying a sparse matrix by a diagonal matrix are benchmarked below. Please let me know if there is a reason why something similar hasn't been implemented already. If it's just an oversight then I can prepare a PR to add in the optimized methods. The benchmarking was run on a fresh clone of master from this morning

julia> versioninfo()
Julia Version 1.1.0-DEV.129
Commit 59c7cedfc4 (2018-08-28 15:47 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)

To benchmark I did the following. I wrote the following script and saved it to a file bnchmrkDiag.jl

using LinearAlgebra,SparseArrays,BenchmarkTools

# Optimized matrix matrix product. Underlying in-place multiplication is non-allocating
function MTimesDiag( A::SparseMatrixCSC, D::Diagonal )
    T = Base.promote_op(*, eltype(D), eltype(A))
    C = LinearAlgebra.copy_oftype(A, T)
    MTimesDiag!(C,D)
end

function MTimesDiag!( A::SparseMatrixCSC, D::Diagonal )

    n = size(A,2)
    if (length(D.diag) != n)
        error("Incompatible sizes")
    end

    for ir = 1:n
        j1 = A.colptr[ir]
        j2 = A.colptr[ir+1] - 1
        dgir = D.diag[ir]
        for ic = j1:j2
            A.nzval[ic] *= dgir
        end
    end
    return A
end

# Alternative implementation using rmul!. It's much better than what you currently get
# with * but doesn't compete with the above method.
function MTimesDiagRmul(A::SparseMatrixCSC, D::Diagonal)
    T = Base.promote_op(*, eltype(D), eltype(A))
    C = LinearAlgebra.copy_oftype(A, T)
    rmul!(C, D)
end

n = 1000;
B = sprandn(n,n,1e-3);
#n = 100000;
#B = sprandn(n,n,5e-4);
D = Diagonal(randn(n));

# Allocating versions
b1 = @benchmarkable $B * $D
b2 = @benchmarkable MTimesDiagRmul($B,$D)
b3 = @benchmarkable MTimesDiag($B,$D)

#in place versions
b4 = @benchmarkable rmul!(C,$D) setup=(C = copy($B))
b5 = @benchmarkable MTimesDiag!(C,$D) setup=(C = copy($B))

Then from the REPL I did

julia> include("bnchmrkDiag.jl")
Benchmark(evals=1, seconds=5.0, samples=10000)

julia> run(b1)
BenchmarkTools.Trial: 
  memory estimate:  24.13 KiB
  allocs estimate:  10
  --------------
  minimum time:     750.135 ms (0.00% GC)
  median time:      755.622 ms (0.00% GC)
  mean time:        756.598 ms (0.00% GC)
  maximum time:     763.333 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

julia> run(b2)
BenchmarkTools.Trial: 
  memory estimate:  120.09 KiB
  allocs estimate:  30
  --------------
  minimum time:     36.090 μs (0.00% GC)
  median time:      38.475 μs (0.00% GC)
  mean time:        50.252 μs (17.26% GC)
  maximum time:     41.074 ms (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(b3)
BenchmarkTools.Trial: 
  memory estimate:  23.80 KiB
  allocs estimate:  4
  --------------
  minimum time:     3.568 μs (0.00% GC)
  median time:      4.060 μs (0.00% GC)
  mean time:        10.303 μs (55.05% GC)
  maximum time:     41.750 ms (99.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(b4)
BenchmarkTools.Trial: 
  memory estimate:  96.30 KiB
  allocs estimate:  26
  --------------
  minimum time:     34.150 μs (0.00% GC)
  median time:      36.624 μs (0.00% GC)
  mean time:        45.483 μs (16.58% GC)
  maximum time:     41.385 ms (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(b5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.841 μs (0.00% GC)
  median time:      2.016 μs (0.00% GC)
  mean time:        2.137 μs (0.00% GC)
  maximum time:     21.772 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

As you can see the simple B * D is incredibly slow compared to the others. As you would expect, the differences become more significant for larger matrices. I re ran the benchmarks with

n = 100000;
B = sprandn(n,n,5e-4);

For reference, that gave nnz(B)/n = 50.00124. I got impatient and killed the B * D benchmark after about a minute. The results for the other methods were

julia> run(b2)
BenchmarkTools.Trial: 
  memory estimate:  160.93 MiB
  allocs estimate:  48
  --------------
  minimum time:     90.440 ms (15.96% GC)
  median time:      123.624 ms (38.25% GC)
  mean time:        110.487 ms (29.57% GC)
  maximum time:     139.199 ms (41.40% GC)
  --------------
  samples:          46
  evals/sample:     1

julia> run(b3)
BenchmarkTools.Trial: 
  memory estimate:  77.06 MiB
  allocs estimate:  7
  --------------
  minimum time:     36.896 ms (10.34% GC)
  median time:      37.501 ms (10.50% GC)
  mean time:        38.607 ms (12.12% GC)
  maximum time:     89.270 ms (62.05% GC)
  --------------
  samples:          130
  evals/sample:     1

julia> run(b4)
BenchmarkTools.Trial: 
  memory estimate:  83.88 MiB
  allocs estimate:  41
  --------------
  minimum time:     47.182 ms (1.35% GC)
  median time:      52.857 ms (9.70% GC)
  mean time:        53.374 ms (11.09% GC)
  maximum time:     88.956 ms (46.19% GC)
  --------------
  samples:          46
  evals/sample:     1

julia> run(b5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.496 ms (0.00% GC)
  median time:      4.910 ms (0.00% GC)
  mean time:        4.945 ms (0.00% GC)
  maximum time:     6.768 ms (0.00% GC)
  --------------
  samples:          128
  evals/sample:     1

You see about the same thing for left multiplication by a diagonal matrix.

Thanks, Patrick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions