Closed
Description
Motivation
Follow up to #4301 , we're now able to compile llama.cpp using Intel's OneAPI compiler and also enable Intel MKL.
Basically, the way Intel MKL works is to provide BLAS-like functions, for example cblas_sgemm
, which inside implements Intel-specific code.
In theory, that should give us better performance. However, in reality, it actually does not give any performance boost (even worse performance in some cases). For example, see: #4816
Benchmarks
NOTE:
- My rig: Framework Laptop 13 | 32GB RAM | Intel 1260P
- When compile with intel oneapi, I always set maximum optimization level
-ipo -O3 -static -fp-model=fast
Here is some performance test using different compile configurations:
- Baseline: Compile with gcc 11
llama_print_timings: load time = 638,31 ms
llama_print_timings: sample time = 20,02 ms / 48 runs ( 0,42 ms per token, 2397,24 tokens per second)
llama_print_timings: prompt eval time = 22078,89 ms / 159 tokens ( 138,86 ms per token, 7,20 tokens per second)
llama_print_timings: eval time = 8162,45 ms / 47 runs ( 173,67 ms per token, 5,76 tokens per second)
llama_print_timings: total time = 34068,72 ms / 206 tokens
- Compile with Intel MKL BLAS (
-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp
)
llama_print_timings: load time = 506.60 ms
llama_print_timings: sample time = 16.19 ms / 62 runs ( 0.26 ms per token, 3830.23 tokens per second)
llama_print_timings: prompt eval time = 26864.97 ms / 159 tokens ( 168.96 ms per token, 5.92 tokens per second)
llama_print_timings: eval time = 8180.71 ms / 61 runs ( 134.11 ms per token, 7.46 tokens per second)
llama_print_timings: total time = 45169.44 ms / 220 tokens
- Compile with Intel MKL BLAS (same as above, but also add
MKL_DIRECT_CALL
andMKL_DIRECT_CALL_JIT
)
(No changed compared to the test case above)
- Compile without Intel MKL BLAS (⭐ BEST PERFORMANCE)
llama_print_timings: load time = 505.97 ms
llama_print_timings: sample time = 28.74 ms / 111 runs ( 0.26 ms per token, 3862.48 tokens per second)
llama_print_timings: prompt eval time = 13399.88 ms / 159 tokens ( 84.28 ms per token, 11.87 tokens per second)
llama_print_timings: eval time = 13088.77 ms / 110 runs ( 118.99 ms per token, 8.40 tokens per second)
llama_print_timings: total time = 30486.68 ms / 269 tokens
- Purely for fun, with AVX and SSE both disabled:
llama_print_timings: load time = 1464.94 ms
llama_print_timings: sample time = 9.49 ms / 38 runs ( 0.25 ms per token, 4004.64 tokens per second)
llama_print_timings: prompt eval time = 87862.51 ms / 159 tokens ( 552.59 ms per token, 1.81 tokens per second)
llama_print_timings: eval time = 21428.97 ms / 37 runs ( 579.16 ms per token, 1.73 tokens per second)
llama_print_timings: total time = 229610.36 ms / 196 tokens
Possible problems
After some testing, what can observe is:
cblas_sgemm
provided by Intel MKL does not use anything other than AVX2 (my CPU does not support AVX-512, damm you intel 12th gen)- Our AVX implementation
ggml_vec_dot_
is already good enough - We does, indeed, lost time on calling
cblas_sgemm
because all the floats must be dequantized (to_float) - Nevertheless, the good point is that optimization
-O3
provided by oneapi compiler does give us some performance boost