I am trying to figure out what to use as optimal kernel parameter for different architectures.
For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a _mm256 and a _mm operation at the same time?
Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?