Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make tensor products faster #58

Open
vpuri3 opened this issue Jun 17, 2022 · 22 comments · Fixed by #59
Open

make tensor products faster #58

vpuri3 opened this issue Jun 17, 2022 · 22 comments · Fixed by #59

Comments

@vpuri3
Copy link
Member

vpuri3 commented Jun 17, 2022

with current @views based implementation:

function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::AbstractVecOrMat)
@assert L.isset "cache needs to be set up for operator of type $(typeof(L)).
set up cache by calling cache_operator(L::AbstractSciMLOperator, u::AbstractArray)"
mi, ni = size(L.inner)
mo, no = size(L.outer)
k = size(u, 2)
C = L.cache
U = _reshape(u, (ni, no*k))
"""
v .= kron(B, A) * u
V .= A * U * B'
"""
# C .= A * U
mul!(C, L.inner, U)
# V .= U * B' <===> V' .= B * C'
if k>1
V = _reshape(v, (mi, mo, k))
C = _reshape(C, (mi, no, k))
@views for i=1:k
mul!(transpose(V[:,:,i]), L.outer, transpose(C[:,:,i]))
end
else
V = _reshape(v, (mi, mo))
C = _reshape(C, (mi, no))
mul!(transpose(V), L.outer, transpose(C))
end
v
end

using SciMLOperators, LinearAlgebra
using BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = rand(12^3, 100)
v = rand(12^3, 100)

A = cache_operator(A, u)

mul!(v, A, u) # dunny
@btime mul!($v, $A, $u); # 16.901 ms (348801 allocations: 45.51 MiB)

B = convert(AbstractMatrix, A)
mul!(v, B, u) # dummy
@btime mul!($v, $B, $u); # 10.337 ms (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.0-rc1
Commit 6368fdc6565 (2022-05-27 18:33 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: 4 × Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_PRECOMPILE_TASKS = 4
  JULIA_DEPOT_PATH = /Users/vp/.julia
  JULIA_NUM_THREADS = 4
@vpuri3
Copy link
Member Author

vpuri3 commented Jun 17, 2022

using SciMLOperators, LinearAlgebra
using BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = [rand(12^3) for i=1:100]
v = [rand(12^3) for i=1:100]

A = cache_operator(A, u[1])

mul!(v[1], A, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $A, $u[i]); end; # 7.333 ms (3301 allocations: 27.05 MiB)

B = convert(AbstractMatrix, A)
mul!(v[1], B, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $B, $u[i]); end; # 129.053 ms (101 allocations: 3.16 KiB)

so applying the operator to individual vectors is indeed faster. likely because we are avoiding the views call

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 17, 2022

@chriselrod @ChrisRackauckas can you take a look please?

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 17, 2022

i think the 'slow' part is that we are applying the operator to column vectors in u one at a time

@views for i=1:k
mul!(transpose(V[:,:,i]), L.outer, transpose(C[:,:,i]))
end

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 17, 2022

there's definitely something im doing wrong because we are slower than LinearMaps, and LinearMaps has allocations in their mul!.

julia> using LinearMaps, LinearAlgebra, BenchmarkTools; A = (rand(12,12), rand(12,12), rand(12,12)); u=rand(12^3,100); v=copy(u); @btime mul!($v, $A, $u);
  5.142 ms (8500 allocations: 27.11 MiB)

https://github.com/JuliaLinearAlgebra/LinearMaps.jl/blob/master/src/kronecker.jl#L117-L131

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 17, 2022

@danielwe

@chriselrod
Copy link

chriselrod commented Jun 17, 2022

i think the 'slow' part is that we are applying the operator to column vectors in u one at a time

@views for i=1:k
mul!(transpose(V[:,:,i]), L.outer, transpose(C[:,:,i]))
end

You could try replacing that with an @turbo or @tturbo loop.

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 17, 2022

@turbo didn't recognize mul! when i checked. and L.outer is not necessarily an AbstractMatrix (for eg it is an AbstractSciMLOperator we nest ops TensorProductOperator(A, B, C) === TensorProductOperator(A, TPPOp(B,C))) so it can't necessarily be indexed. So we can't get rid of the mul! really unless we can come up with a general implementation for TensorProductOperator with an arbitrary number of arguments

@chriselrod
Copy link

I meant delete mul!, replacing it with loops

@ChrisRackauckas
Copy link
Member

You can't assume operators are indexable, it's not part of the interface (for a good reason, it may be a matrix-free operator). If you need to do that, then we can have a trait for the fast indexing (extending the ArrayInterface one), but I'm not sure that's useful since in most cases it should be matrix-free

@chriselrod
Copy link

No indexing, but views are okay?

We could have a special overload for certain types.

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 18, 2022

@chriselrod im not sure how view without indexing would work.

@views for i=1:k
mul!(transpose(V[:,:,i]), L.outer, transpose(C[:,:,i]))
end

in this case, the slices C[:,:,i], V[:,:,i] are passed into mul! as subarrays. But L.outer isn't guaranteed to be indexable.

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 18, 2022

in #59 i've written an implementation with using permutedims! that is faster than LinearMaps.jl and doesn't allocate. could you take a look and suggest improvements?

@ChrisRackauckas
Copy link
Member

No indexing, but views are okay?

Views are not okay. Views and indexing are not allowed on the operator. It's fine on the actor though. It's A*v where A is a linear operator which may not necessarily have an easy matrix representation. A[i,:] may only be defined by matrix-vector multiplications (i.e. very slow), so direct calculations of A*v is preferred. While there is MatrixOperator, that's probably one of the only cases where we can assume it (the operator) has fast indexing.

@danielwe
Copy link

While there is MatrixOperator, that's probably one of the only cases where we can assume it (the operator) has fast indexing.

What about GPU matrices? I don't think you can assume fast indexing for MatrixOperator either.

(I don't have anything to contribute with regards to the issue at hand, sorry.)

@ChrisRackauckas
Copy link
Member

Yeah, fast_indexing on MatrixOperator would have to check the ArrayInterface fast_indexing function.

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 18, 2022

with current implementation,

function LinearAlgebra.ldiv!(v::AbstractVecOrMat, L::TensorProductOperator, u::AbstractVecOrMat)
@assert L.isset "cache needs to be set up for operator of type $(typeof(L)).
set up cache by calling cache_operator(L::AbstractSciMLOperator, u::AbstractArray)"
mi, ni = size(L.inner)
mo, no = size(L.outer)
k = size(u, 2)
perm = (2, 1, 3)
C1, C2, C3, _ = L.cache
U = _reshape(u, (ni, no*k))
"""
v .= kron(B, A) ldiv u
V .= (A ldiv U) / B'
"""
# C .= A \ U
ldiv!(C1, L.inner, U)
# V .= C / B' <===> V' .= B \ C'
if k>1
if L.outer isa IdentityOperator
copyto!(v, C1)
else
C1 = _reshape(C1, (mi, no, k))
permutedims!(C2, C1, perm)
C2 = _reshape(C2, (no, mi*k))
ldiv!(C3, L.outer, C2)
C3 = _reshape(C3, (mo, mi, k))
V = _reshape(v , (mi, mo, k))
permutedims!(V, C3, perm)
end
else
V = _reshape(v , (mi, mo))
C1 = _reshape(C1, (mi, no))
ldiv!(transpose(V), L.outer, transpose(C1))
end
v
end

using SciMLOperators, LinearAlgebra, BenchmarkTools

u = rand(12^2, 100)
v = rand(12^2, 100)

A = rand(12, 12) # outer 
B = rand(12, 12) # inner

T = TensorProductOperator(A, B)
T = cache_operator(T, u)

@btime mul!($v, $T, $u); # 80.416 μs (7 allocations: 304 bytes)


# reference computation - a single matvec
u = reshape(u, (12, 12*100))
v = reshape(v, (12, 12*100))
@btime mul!($v, $A, $u); # 24.005 μs (0 allocations: 0 bytes)

gotta track those allocations.

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 20, 2022

The two permutedims! calls together take about as much time as a single mul!(). So we should add more cases to the L.outer matvec computation to avoid the permutedims! wherever possible.

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 21, 2022

FYI currently we are at:

using SciMLOperators, LinearAlgebra, BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = rand(12^3, 100)
v = rand(12^3, 100)

A = cache_operator(A, u)

mul!(v, A, u) # dunny
@btime mul!($v, $A, $u); # 4.916 ms (17 allocations: 31.36 KiB)

using SciMLOperators, LinearAlgebra, BenchmarkTools

u = rand(12^2, 100)
v = rand(12^2, 100)

A = rand(12, 12) # outer 
B = rand(12, 12) # inner

T = TensorProductOperator(A, B)
T = cache_operator(T, u)

@btime mul!($v, $T, $u); # 80.081 μs (7 allocations: 304 bytes)

nothing

@chriselrod
Copy link

chriselrod commented Jun 21, 2022

Locally, I tried using @turbo in one of the mul! functions in sciml.jl.
Master:

julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  44.266 μs  123.030 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     46.153 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.537 μs ±   1.868 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▄▇█▇▅▃▁
  ▁▁▁▂▂▃▆███████▇▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  44.3 μs         Histogram: frequency by time         54.8 μs <

 Memory estimate: 304 bytes, allocs estimate: 7.

With @turbo:

julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  18.194 μs  84.525 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     19.054 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.176 μs ±  1.133 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃▄▇██▇▇▄▃
  ▂▂▃▄▆███████████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▂ ▃
  18.2 μs         Histogram: frequency by time        22.9 μs <

 Memory estimate: 96 bytes, allocs estimate: 3.

This was with

using SciMLOperators, LinearAlgebra, BenchmarkTools

u = rand(12^2, 100)
v = rand(12^2, 100)

A = rand(12, 12) # outer 
B = rand(12, 12) # inner

T = TensorProductOperator(A, B)
T = cache_operator(T, u)

Using @tturbo instead:

julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  8.935 μs  101.421 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     9.533 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.667 μs ±   1.218 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄▇██▅▃▃▂▃▄▂
  ▂▂▃▅█████████████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  8.94 μs         Histogram: frequency by time        12.5 μs <

 Memory estimate: 96 bytes, allocs estimate: 3.

About 5x faster. Note that LinearAlgebra.mul! is also multithreaded.

@chriselrod
Copy link

chriselrod commented Jun 21, 2022

The code would need dispatches set correctly to behave correctly as a function of input types.

diff --git a/src/sciml.jl b/src/sciml.jl
index 5291b1d..f482517 100644
--- a/src/sciml.jl
+++ b/src/sciml.jl
@@ -593,7 +593,7 @@ function cache_internals(L::TensorProductOperator, u::AbstractVecOrMat) where{D}
     @set! L.outer = cache_operator(L.outer, uouter)
     L
 end
-
+using LoopVectorization
 function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::AbstractVecOrMat)
     @assert L.isset "cache needs to be set up for operator of type $(typeof(L)).
     set up cache by calling cache_operator(L::AbstractSciMLOperator, u::AbstractArray)"
@@ -605,7 +605,6 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
     perm = (2, 1, 3)
     C1, C2, C3, _ = L.cache
     U = _reshape(u, (ni, no*k))
-
     """
         v .= kron(B, A) * u
         V .= A * U * B'
@@ -619,13 +618,16 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
         if L.outer isa IdentityOperator
             copyto!(v, C1)
         else
-            C1 = _reshape(C1, (mi, no, k))
-            permutedims!(C2, C1, perm)
-            C2 = _reshape(C2, (no, mi*k))
-            mul!(C3, L.outer, C2)
-            C3 = _reshape(C3, (mo, mi, k))
-            V  = _reshape(v , (mi, mo, k))
-            permutedims!(V, C3, perm)
+          C1 = _reshape(C1, (mi, no, k))
+          Lo = L.outer
+          V  = _reshape(v , (mi, mo, k))
+          @turbo for i = 1:mi, j = 1:k, m = 1:mo
+            a = zero(Base.promote_eltype(V,Lo,C2))
+            for o = 1:no
+              a += Lo[m,o] * C1[i, o, j]
+            end
+            V[i,m,j] = a
+          end

With just this diff,

using SciMLOperators, LinearAlgebra, BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = rand(12^3, 100)
v = rand(12^3, 100)

A = cache_operator(A, u)

mul!(v, A, u) # dunny

fails, for example.

@chriselrod
Copy link

I could prepare a PR, or someone else more familiar with all the dispatches needed to only funnel valid types to @turbo methods.

The approach is simply to replace reshape + permutedims! + mul! with a single @turbo loop that does it all.

@vpuri3
Copy link
Member Author

vpuri3 commented Jun 5, 2023

new idea: try using PermutedDimsArray instead of permutedims/ permutedims!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants