Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

Closed
wants to merge 2 commits into from

Conversation

Sacha0
Copy link
Member

@Sacha0 Sacha0 commented Oct 7, 2017

This pull request provides improved dense-sparse matrix multiplication kernels. Equivalent kernels for sparse-dense matrix multiplication to follow. (Edit: Equivalent kernels for sparse-dense matrix multiplication have landed.) Further optimization of the trickier cases in this pull request (see comments) also to follow.

Performance in a machine learning application from @pevnak motivated this pull request (see slack/#linalg). Specifically, IIRC TensorFlow outperforms Flux on master on his problem, whereas with these kernels (and the forthcoming sparse-dense equivalents) the tables turn.

Comprehensive benchmarks forthcoming (via BaseBenchmarks PR JuliaCI/BaseBenchmarks.jl#128).

Best!

@Sacha0 Sacha0 added linear algebra Linear algebra potential benchmark Could make a good benchmark in BaseBenchmarks performance Must go faster sparse Sparse arrays labels Oct 7, 2017
@Sacha0 Sacha0 force-pushed the dsmult branch 3 times, most recently from cf44d90 to 302917c Compare October 8, 2017 19:24
@Sacha0 Sacha0 changed the title [WIP] improved dense-sparse matrix multiplication kernels [WIP] improved dense-sparse and sparse-dense matrix multiplication kernels Oct 8, 2017
@Sacha0
Copy link
Member Author

Sacha0 commented Oct 8, 2017

Equivalent kernels for sparse-dense matrix multiplication are now live in the second commit. I have a third commit that consolidates the dense-sparse and sparse-dense methods, but would like to run nanosoldier before and after pushing that commit to check that the consolidation does not impact performance.

@ararslan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

@Sacha0
Copy link
Member Author

Sacha0 commented Oct 10, 2017

From nanosoldier's logs,

ERROR: LoadError: MethodError: no method matching h5d_write(::Int32, ::Int32, ::Base.ReinterpretArray{UInt8,1,HDF5.HDF5ReferenceObj,Array{HDF5.HDF5ReferenceObj,1}})
Closest candidates are:
  h5d_write(::Any, ::Any, ::Any, !Matched::Any, !Matched::Any, !Matched::Any) at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:2053
  h5d_write(::Int32, ::Int32, !Matched::String) at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:1922
  h5d_write(::Int32, ::Int32, !Matched::Array{S<:String,N} where N) where S<:String at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:1929
  ...
Stacktrace:
 [1] writearray(::HDF5.HDF5Dataset, ::Int32, ::Base.ReinterpretArray{UInt8,1,HDF5.HDF5ReferenceObj,Array{HDF5.HDF5ReferenceObj,1}}) at /home/nanosoldier/.julia/v0.7/HDF5/src/HDF5.jl:1807
 [2] #_write#17(::Array{Any,1}, ::Function, ::JLD.JldFile, ::String, ::Array{Any,1}, ::JLD.JldWriteSession) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:574
 [3] _write(::JLD.JldFile, ::String, ::Array{Any,1}, ::JLD.JldWriteSession) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:559
 [4] #write#14(::Array{Any,1}, ::Function, ::JLD.JldFile, ::String, ::Any, ::JLD.JldWriteSession) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:512
 [5] #jldopen#9(::Bool, ::Bool, ::Bool, ::Function, ::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:178
 [6] (::getfield(JLD, Symbol("#kw##jldopen")))(::Array{Any,1}, ::typeof(JLD.jldopen), ::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./<missing>:0
 [7] #jldopen#10(::Bool, ::Bool, ::Bool, ::Function, ::String, ::String) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:231
 [8] (::getfield(JLD, Symbol("#kw##jldopen")))(::Array{Any,1}, ::typeof(JLD.jldopen), ::String, ::String) at ./<missing>:0
 [9] #jldopen#11(::Array{Any,1}, ::Function, ::getfield(JLD, Symbol("##34#35")){String,Dict{String,String},Tuple{String,BenchmarkTools.BenchmarkGroup}}, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:241
 [10] (::getfield(JLD, Symbol("#kw##jldopen")))(::Array{Any,1}, ::typeof(JLD.jldopen), ::Function, ::String, ::String) at ./<missing>:0
 [11] #save#33(::Bool, ::Bool, ::Function, ::FileIO.File{FileIO.DataFormat{:JLD}}, ::String, ::Any, ::Any, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:1220
 [12] save(::FileIO.File{FileIO.DataFormat{:JLD}}, ::String, ::Dict{String,String}, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/JLD/src/JLD.jl:1217
 [13] #save#14(::Array{Any,1}, ::Function, ::String, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/FileIO/src/loadsave.jl:61
 [14] save(::String, ::String, ::Dict{String,String}, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/FileIO/src/loadsave.jl:61
 [15] save(::String, ::String, ::Vararg{Any,N} where N) at /home/nanosoldier/.julia/v0.7/BenchmarkTools/src/serialization.jl:36
 [16] include_relative(::Module, ::String) at ./loading.jl:533
 [17] include(::Module, ::String) at ./sysimg.jl:14
 [18] process_options(::Base.JLOptions) at ./client.jl:325
 [19] _start() at ./client.jl:391
in expression starting at /home/nanosoldier/workdir/tmpra50k1/benchscript.jl:28

which looks like #23750 (comment) ? (Update: The JLD issue appears resolved by JuliaIO/JLD.jl#192.)

@ararslan
Copy link
Member

Yep, that's a JLD issue that just got fixed. Let's try this again.

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@Sacha0
Copy link
Member Author

Sacha0 commented Oct 13, 2017

@nanosoldier runbenchmarks("sparse" && "matmul", vs=":master")

@Sacha0
Copy link
Member Author

Sacha0 commented Oct 13, 2017

Nanosoldier reported meaningful regressions only in the kernel for A(t|c)_mul_B[!]([dense,] sparse, dense). Local testing indicated that an added @simd annotation, replacement of a z += x*y with a z = muladd(x, y, z), and the type assertion in CjAjB::eltype(C) = zero(eltype(C)) each negatively impacted performance, so I've removed each. If these changes fix the A(t|c)_mul_B[!]([dense,] sparse, dense) regressions per nanosoldier, I plan to merge this pull request and defer consolidation and additional improvements to later pull requests. Best!

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: stored type BenchmarkTools.ParametersPreV006 does not match currently loaded type

Logs and partial data can be found here
cc @ararslan

@andreasnoack
Copy link
Member

Fixes #22615

@KristofferC
Copy link
Member

@nanosoldier runbenchmarks("sparse" && "matmul", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@ararslan
Copy link
Member

ararslan commented Nov 8, 2017

The sparse matmul benchmarks have been pretty noisy on master, so it's kind of hard to determine how much is related and how much is noise.

@ViralBShah
Copy link
Member

ViralBShah commented Mar 4, 2018

Should we try to bring back these improvements?

@Sacha0
Copy link
Member Author

Sacha0 commented Mar 4, 2018

I'd love to. Regrettably I won't have time to see this through for the foreseeable future. Best!

@KristofferC
Copy link
Member

KristofferC commented Apr 10, 2018

Is there anything "big" that is left to do here @Sacha0? Getting these in would be very nice so if it is just a manner of rebasing (A_mul_B! -> mul! etc), checking Nanosoldier and doing some small performance tweaks, I could look at it.

@Sacha0
Copy link
Member Author

Sacha0 commented Apr 10, 2018

Is there anything "big" that is left to do here @Sacha0? Getting these in would be very nice so if it is just a manner of rebasing (A_mul_B! -> mul! etc), checking Nanosoldier and doing some small performance tweaks, I could look at it.

[Insert hazy memory disclaimer here.] Much thanks for looking into resuscitating this pull request! :) Rebasing through the lazy adjoint/transpose change and associated cleanup should make this mergeworthy again. Of course there remains potential downstream work, but none of it pressing. (If this pull request nonetheless remains open then, I hope to polish it off after submitting my dissertation and figuring out immediate next steps.) Best!

@jebej
Copy link
Contributor

jebej commented May 9, 2018

It would be amazing to have this land :)

@carstenbauer
Copy link
Member

carstenbauer commented Nov 6, 2018

This would fix https://discourse.julialang.org/t/asymmetric-speed-of-in-place-sparse-dense-matrix-product/10256/4 I guess.

@mschauer
Copy link
Contributor

Currently, B*A, A*B and B'*A create dense matrices and perform reasonably, while not perfectly.
Here A is full and B is sparse.

A*B' freezes for some time while it creates a full sparse matrix, which is likely unwanted. I just note that this PR would fix this too (I think this problem is less subtile than others performance issues mentioned.)

@ViralBShah
Copy link
Member

@Sacha0 Would it be possible for you to revive this PR?

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 17, 2018

@Sacha0 Would it be possible for you to revive this PR?

That's the hope! :)

@ViralBShah
Copy link
Member

Thank you. :-)

@vtjnash
Copy link
Member

vtjnash commented Apr 13, 2021

IIUC, this was revived in #38876, but we might still need the tests from here, as it seems like that was merged without tests? @Sacha0, @dkarrasch, or @ViralBShah would you be able confirm and close or update this, as appropriate?

@dkarrasch
Copy link
Member

We are covering all but one cases according to https://codecov.io/gh/JuliaLang/julia/src/master/stdlib/SparseArrays/src/linalg.jl. I'll add one as soon as I get to it. Other than that, this could be closed. The remaining aspect of this one was to (silently) materialize certain adjoints/transposes, but in #40089 I found this can be sometimes advantageous and sometimes not, so it doesn't look like a good general solution and materializing should be left to the user.

@vtjnash vtjnash closed this Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks sparse Sparse arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.