Skip to content

Investigate: performance with Intel OneAPI (MKL) #5067

Closed
@ngxson

Description

@ngxson

Motivation

Follow up to #4301 , we're now able to compile llama.cpp using Intel's OneAPI compiler and also enable Intel MKL.

Basically, the way Intel MKL works is to provide BLAS-like functions, for example cblas_sgemm, which inside implements Intel-specific code.

In theory, that should give us better performance. However, in reality, it actually does not give any performance boost (even worse performance in some cases). For example, see: #4816

Benchmarks

NOTE:

  • My rig: Framework Laptop 13 | 32GB RAM | Intel 1260P
  • When compile with intel oneapi, I always set maximum optimization level -ipo -O3 -static -fp-model=fast

Here is some performance test using different compile configurations:

  • Baseline: Compile with gcc 11
llama_print_timings:        load time =     638,31 ms
llama_print_timings:      sample time =      20,02 ms /    48 runs   (    0,42 ms per token,  2397,24 tokens per second)
llama_print_timings: prompt eval time =   22078,89 ms /   159 tokens (  138,86 ms per token,     7,20 tokens per second)
llama_print_timings:        eval time =    8162,45 ms /    47 runs   (  173,67 ms per token,     5,76 tokens per second)
llama_print_timings:       total time =   34068,72 ms /   206 tokens
  • Compile with Intel MKL BLAS (-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp)
llama_print_timings:        load time =     506.60 ms
llama_print_timings:      sample time =      16.19 ms /    62 runs   (    0.26 ms per token,  3830.23 tokens per second)
llama_print_timings: prompt eval time =   26864.97 ms /   159 tokens (  168.96 ms per token,     5.92 tokens per second)
llama_print_timings:        eval time =    8180.71 ms /    61 runs   (  134.11 ms per token,     7.46 tokens per second)
llama_print_timings:       total time =   45169.44 ms /   220 tokens
  • Compile with Intel MKL BLAS (same as above, but also add MKL_DIRECT_CALL and MKL_DIRECT_CALL_JIT)

(No changed compared to the test case above)

  • Compile without Intel MKL BLAS (⭐ BEST PERFORMANCE)
llama_print_timings:        load time =     505.97 ms
llama_print_timings:      sample time =      28.74 ms /   111 runs   (    0.26 ms per token,  3862.48 tokens per second)
llama_print_timings: prompt eval time =   13399.88 ms /   159 tokens (   84.28 ms per token,    11.87 tokens per second)
llama_print_timings:        eval time =   13088.77 ms /   110 runs   (  118.99 ms per token,     8.40 tokens per second)
llama_print_timings:       total time =   30486.68 ms /   269 tokens
  • Purely for fun, with AVX and SSE both disabled:
llama_print_timings:        load time =    1464.94 ms
llama_print_timings:      sample time =       9.49 ms /    38 runs   (    0.25 ms per token,  4004.64 tokens per second)
llama_print_timings: prompt eval time =   87862.51 ms /   159 tokens (  552.59 ms per token,     1.81 tokens per second)
llama_print_timings:        eval time =   21428.97 ms /    37 runs   (  579.16 ms per token,     1.73 tokens per second)
llama_print_timings:       total time =  229610.36 ms /   196 tokens

Possible problems

After some testing, what can observe is:

  • cblas_sgemm provided by Intel MKL does not use anything other than AVX2 (my CPU does not support AVX-512, damm you intel 12th gen)
  • Our AVX implementation ggml_vec_dot_ is already good enough
  • We does, indeed, lost time on calling cblas_sgemm because all the floats must be dequantized (to_float)
  • Nevertheless, the good point is that optimization -O3 provided by oneapi compiler does give us some performance boost

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions