-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make tensor products faster #58
Comments
using SciMLOperators, LinearAlgebra
using BenchmarkTools
A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))
u = [rand(12^3) for i=1:100]
v = [rand(12^3) for i=1:100]
A = cache_operator(A, u[1])
mul!(v[1], A, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $A, $u[i]); end; # 7.333 ms (3301 allocations: 27.05 MiB)
B = convert(AbstractMatrix, A)
mul!(v[1], B, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $B, $u[i]); end; # 129.053 ms (101 allocations: 3.16 KiB) so applying the operator to individual vectors is indeed faster. likely because we are avoiding the |
@chriselrod @ChrisRackauckas can you take a look please? |
i think the 'slow' part is that we are applying the operator to column vectors in SciMLOperators.jl/src/sciml.jl Lines 608 to 610 in 746c0d5
|
there's definitely something im doing wrong because we are slower than LinearMaps, and LinearMaps has allocations in their julia> using LinearMaps, LinearAlgebra, BenchmarkTools; A = ⊗(rand(12,12), rand(12,12), rand(12,12)); u=rand(12^3,100); v=copy(u); @btime mul!($v, $A, $u);
5.142 ms (8500 allocations: 27.11 MiB) https://github.com/JuliaLinearAlgebra/LinearMaps.jl/blob/master/src/kronecker.jl#L117-L131 |
You could try replacing that with an |
|
I meant delete |
You can't assume operators are indexable, it's not part of the interface (for a good reason, it may be a matrix-free operator). If you need to do that, then we can have a trait for the fast indexing (extending the ArrayInterface one), but I'm not sure that's useful since in most cases it should be matrix-free |
No indexing, but We could have a special overload for certain types. |
@chriselrod im not sure how
in this case, the slices |
in #59 i've written an implementation with using |
Views are not okay. Views and indexing are not allowed on the operator. It's fine on the actor though. It's |
What about GPU matrices? I don't think you can assume fast indexing for MatrixOperator either. (I don't have anything to contribute with regards to the issue at hand, sorry.) |
Yeah, |
with current implementation, SciMLOperators.jl/src/sciml.jl Lines 684 to 724 in c948888
using SciMLOperators, LinearAlgebra, BenchmarkTools
u = rand(12^2, 100)
v = rand(12^2, 100)
A = rand(12, 12) # outer
B = rand(12, 12) # inner
T = TensorProductOperator(A, B)
T = cache_operator(T, u)
@btime mul!($v, $T, $u); # 80.416 μs (7 allocations: 304 bytes)
# reference computation - a single matvec
u = reshape(u, (12, 12*100))
v = reshape(v, (12, 12*100))
@btime mul!($v, $A, $u); # 24.005 μs (0 allocations: 0 bytes) gotta track those allocations. |
The two |
FYI currently we are at: using SciMLOperators, LinearAlgebra, BenchmarkTools
A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))
u = rand(12^3, 100)
v = rand(12^3, 100)
A = cache_operator(A, u)
mul!(v, A, u) # dunny
@btime mul!($v, $A, $u); # 4.916 ms (17 allocations: 31.36 KiB)
using SciMLOperators, LinearAlgebra, BenchmarkTools
u = rand(12^2, 100)
v = rand(12^2, 100)
A = rand(12, 12) # outer
B = rand(12, 12) # inner
T = TensorProductOperator(A, B)
T = cache_operator(T, u)
@btime mul!($v, $T, $u); # 80.081 μs (7 allocations: 304 bytes)
nothing |
Locally, I tried using julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 44.266 μs … 123.030 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 46.153 μs ┊ GC (median): 0.00%
Time (mean ± σ): 46.537 μs ± 1.868 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇█▇▅▃▁
▁▁▁▂▂▃▆███████▇▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
44.3 μs Histogram: frequency by time 54.8 μs <
Memory estimate: 304 bytes, allocs estimate: 7. With julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 18.194 μs … 84.525 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 19.054 μs ┊ GC (median): 0.00%
Time (mean ± σ): 19.176 μs ± 1.133 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▄▇██▇▇▄▃
▂▂▃▄▆███████████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▂ ▃
18.2 μs Histogram: frequency by time 22.9 μs <
Memory estimate: 96 bytes, allocs estimate: 3. This was with using SciMLOperators, LinearAlgebra, BenchmarkTools
u = rand(12^2, 100)
v = rand(12^2, 100)
A = rand(12, 12) # outer
B = rand(12, 12) # inner
T = TensorProductOperator(A, B)
T = cache_operator(T, u) Using julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 8.935 μs … 101.421 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.533 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.667 μs ± 1.218 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇██▅▃▃▂▃▄▂
▂▂▃▅█████████████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
8.94 μs Histogram: frequency by time 12.5 μs <
Memory estimate: 96 bytes, allocs estimate: 3. About 5x faster. Note that |
The code would need dispatches set correctly to behave correctly as a function of input types. diff --git a/src/sciml.jl b/src/sciml.jl
index 5291b1d..f482517 100644
--- a/src/sciml.jl
+++ b/src/sciml.jl
@@ -593,7 +593,7 @@ function cache_internals(L::TensorProductOperator, u::AbstractVecOrMat) where{D}
@set! L.outer = cache_operator(L.outer, uouter)
L
end
-
+using LoopVectorization
function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::AbstractVecOrMat)
@assert L.isset "cache needs to be set up for operator of type $(typeof(L)).
set up cache by calling cache_operator(L::AbstractSciMLOperator, u::AbstractArray)"
@@ -605,7 +605,6 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
perm = (2, 1, 3)
C1, C2, C3, _ = L.cache
U = _reshape(u, (ni, no*k))
-
"""
v .= kron(B, A) * u
V .= A * U * B'
@@ -619,13 +618,16 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
if L.outer isa IdentityOperator
copyto!(v, C1)
else
- C1 = _reshape(C1, (mi, no, k))
- permutedims!(C2, C1, perm)
- C2 = _reshape(C2, (no, mi*k))
- mul!(C3, L.outer, C2)
- C3 = _reshape(C3, (mo, mi, k))
- V = _reshape(v , (mi, mo, k))
- permutedims!(V, C3, perm)
+ C1 = _reshape(C1, (mi, no, k))
+ Lo = L.outer
+ V = _reshape(v , (mi, mo, k))
+ @turbo for i = 1:mi, j = 1:k, m = 1:mo
+ a = zero(Base.promote_eltype(V,Lo,C2))
+ for o = 1:no
+ a += Lo[m,o] * C1[i, o, j]
+ end
+ V[i,m,j] = a
+ end With just this diff, using SciMLOperators, LinearAlgebra, BenchmarkTools
A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))
u = rand(12^3, 100)
v = rand(12^3, 100)
A = cache_operator(A, u)
mul!(v, A, u) # dunny fails, for example. |
I could prepare a PR, or someone else more familiar with all the dispatches needed to only funnel valid types to The approach is simply to replace |
new idea: try using |
with current
@views
based implementation:SciMLOperators.jl/src/sciml.jl
Lines 584 to 618 in 746c0d5
The text was updated successfully, but these errors were encountered: