Open
Description
StaticArrays
has heuristics that determine what code to make to multiply matrices. There seems to have a heuristic for BlasFloat
and one for everything else (i.e.Any
). The present Any
heuristic makes bad choices for Dual
. I think that it would be straightforward to create a better heuristic for ForwardDiff.Dual
by including the number of partials in the heuristic. The obvious issue with this is that it would require ForwardDiff
to be a dependency to StaticArrays
. Is there a good way to get this performance issue fixed?
using BenchmarkTools
using ForwardDiff
using StaticArrays
Type_Dual = ForwardDiff.Dual{Float64,Float64,26}
A = rand(SMatrix{4,4,Type_Dual,16})
B = rand(SMatrix{4,4,Type_Dual,16})
@btime $A * $B # DEFAULT
# 1.376 μs (0 allocations: 0 bytes)
@btime StaticArrays.mul_loop($(Size(A)),$(Size(B)),$A,$B)
# 614.142 ns (0 allocations: 0 bytes)
@btime StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$A,$B)
# 688.962 ns (0 allocations: 0 bytes)
@btime StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$A,$B)
# 1.382 μs (0 allocations: 0 bytes)