Description
TensorPrimitive
by default delegates simple operators to vector intrinsics. This is fine for most operations, but IDIV is an exception.
First, most (if not all) ISAs lack support for IDIV in vector. I've checked AVX512/Avx2 and Sve/AdvSimd but don't find it. Thus our intrinsic vector will use software simulation. On my CPU with AVX2, it's about 2.5x slower comparing to naive for-loop on int[1024] / int(scalar)
.
When dividing with a common divisor, there is also the widely-used preinv algorithm to turn the division into cheaper multiplication, which is supported for vectorization on various ISAs.
I'm not sure if integer division is popular enough for this optimization. But we should at least disable DivideOperator.Vectorizable
for integer types, because it ends up uses software simulation.