-
Notifications
You must be signed in to change notification settings - Fork 76
Using the LIBXSMM MKL JIT backend
Fastor is a stand-alone library and does not depend on any external BLAS for it's linear algebra routines. However, given that Fastor's primary focus is on speeding up operations on small tensors, it also provides switching backeneds to libxsmm and MKL JIT for small matrix-matrix and matrix-vector multiplications if need be. This is specifically useful in cases where performance portability is important.
To activate libxsmm backend, you need to first download and build libxsmm as a static or dynamic shared library and then compile your Fastor's code by issuing -DFASTOR_USE_LIBXSMM
flag to the compiler. You then need to link to libxsmm as -L/path/to/libxsmm/lib/ -lxsmm -lblas -ldl
. Fastor will automatically switch to libxsmm for all matmul
routines.
To activate MKL JIT backend, you need to first download and install the Intel MKL library and then compile your Fastor's code by issuing -DFASTOR_USE_MKL
flag to the compiler. You then need to link to MKL as -L/path/to/mkl/lib/ -lmkl_rt
. Fastor will automatically switch to MKL for all matmul
routines.
The switch can also be configured based on the matrix size using compiler flags FASTOR_BLAS_SWITCH_MATRIX_SIZE
. The default value is 16 for square matrices, that is matrix multiplications with M=N=K>16
will be dispatched to BLAS if one is available. But might be best experimenting with this value on your own architecture. The default value for non-square matrices is cbrt(M*K*N)>16
.
Here is a Google benchmark of a complex Kalman filter problem implemented in Fastor (that uses a lot matmul operations on square and non-square matrices) using built-in matmul vs libxsmm dispatched matmul:
BUILT-IN MATMUL:
Run on (8 X 2300 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 1.31, 1.23, 1.06
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------
CovarianceUpdateFastor<8> 245 ns 244 ns 2574429
CovarianceUpdateFastor<16> 948 ns 948 ns 614202
CovarianceUpdateFastor<32> 5483 ns 5480 ns 99416
CovarianceUpdateFastor<64> 28338 ns 28306 ns 14543
CovarianceUpdateFastor<128> 221356 ns 221054 ns 1912
```
LIBXSMM DISPATCHED MATMUL:
Run on (8 X 2300 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 1.31, 1.23, 1.06
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------
CovarianceUpdateFastor<8> 254 ns 254 ns 2614574
CovarianceUpdateFastor<16> 955 ns 955 ns 620507
CovarianceUpdateFastor<32> 5553 ns 5550 ns 123622
CovarianceUpdateFastor<64> 30319 ns 30296 ns 23204
CovarianceUpdateFastor<128> 218788 ns 218571 ns 3583
The default switches were used for dispatching for this analysis. Notice that Fastor is well tuned for small matrix-matrix multiplications and in this case libxsmm only takes over marginally for size 128
.