Run:
FC=gfortran cmake -DMATMUL_BLAS=OpenBLAS .
make
OMP_NUM_THREADS=1 ./matmul
The theoretical performance peak for matmul is just the cost of fma
, which is
0.125 clock cycles per double precision matrix element (fmla.2d v0, v0, v0
takes 0.25 cycles), and 0.0625 per single precision element.
Single precison (f32) matmul
peak = 0.0625 clock cycles
n OpenBlas
512 0.0768
1024 0.0672
2048 0.0640
4096 0.0632
8192 0.0631
To convert these clock cycles to seconds, multiply by n^3 and divide by 3.2GHz. For example the n=8192 case gives 10.84s:
>>> n = 8192; 0.0631*n**3 / 3.2e9
10.840497455104