Open
Description
Currently the batchedmul
will fallback to calling mul!
if the element type doesn't have BLAS support, and the fallback mul!
is quite slow. For example, batchedmul(randn(Float16, 1024, 1024, 32), randn(Float16, 1024, 1024, 32))
on my computer never terminated within 5 minutes, while replace the line with Octavian.matmul!
and re-run give:
julia> a = randn(Float16, 1024, 1024, 32);
julia> b = randn(Float16, 1024, 1024, 32);
julia> @btime batched_mul($a, $b);
260.851 ms (2 allocations: 64.00 MiB)
julia> a32 = randn(Float32, 1024, 1024, 32);
julia> b32 = randn(Float32, 1024, 1024, 32);
julia> @btime batched_mul($a32, $b32);
128.949 ms (38 allocations: 128.00 MiB)
Which is not optimal, but IMPO good enough compare to the current status.