Investigate: performance with Intel OneAPI (MKL)

## Motivation

Follow up to https://github.com/ggerganov/llama.cpp/issues/4301 , we're now able to compile llama.cpp using Intel's OneAPI compiler and also enable Intel MKL.

Basically, the way Intel MKL works is to provide BLAS-like functions, for example `cblas_sgemm`, which inside implements Intel-specific code.

In theory, that should give us better performance. However, in reality, it actually does **not** give any performance boost (even worse performance in some cases). For example, see: https://github.com/ggerganov/llama.cpp/issues/4816

## Benchmarks

NOTE:
- My rig: Framework Laptop 13 | 32GB RAM | Intel 1260P
- When compile with intel oneapi, I always set maximum optimization level `-ipo -O3 -static -fp-model=fast`

Here is some performance test using different compile configurations:

- Baseline: Compile with gcc 11

```
llama_print_timings:        load time =     638,31 ms
llama_print_timings:      sample time =      20,02 ms /    48 runs   (    0,42 ms per token,  2397,24 tokens per second)
llama_print_timings: prompt eval time =   22078,89 ms /   159 tokens (  138,86 ms per token,     7,20 tokens per second)
llama_print_timings:        eval time =    8162,45 ms /    47 runs   (  173,67 ms per token,     5,76 tokens per second)
llama_print_timings:       total time =   34068,72 ms /   206 tokens
```

- Compile with Intel MKL BLAS (`-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp`)

```
llama_print_timings:        load time =     506.60 ms
llama_print_timings:      sample time =      16.19 ms /    62 runs   (    0.26 ms per token,  3830.23 tokens per second)
llama_print_timings: prompt eval time =   26864.97 ms /   159 tokens (  168.96 ms per token,     5.92 tokens per second)
llama_print_timings:        eval time =    8180.71 ms /    61 runs   (  134.11 ms per token,     7.46 tokens per second)
llama_print_timings:       total time =   45169.44 ms /   220 tokens
```

- Compile with Intel MKL BLAS (same as above, but also add `MKL_DIRECT_CALL` and `MKL_DIRECT_CALL_JIT`)

(No changed compared to the test case above)

- Compile **without** Intel MKL BLAS (⭐ **BEST PERFORMANCE**)

```
llama_print_timings:        load time =     505.97 ms
llama_print_timings:      sample time =      28.74 ms /   111 runs   (    0.26 ms per token,  3862.48 tokens per second)
llama_print_timings: prompt eval time =   13399.88 ms /   159 tokens (   84.28 ms per token,    11.87 tokens per second)
llama_print_timings:        eval time =   13088.77 ms /   110 runs   (  118.99 ms per token,     8.40 tokens per second)
llama_print_timings:       total time =   30486.68 ms /   269 tokens
```

- Purely for fun, with AVX and SSE both disabled:

```
llama_print_timings:        load time =    1464.94 ms
llama_print_timings:      sample time =       9.49 ms /    38 runs   (    0.25 ms per token,  4004.64 tokens per second)
llama_print_timings: prompt eval time =   87862.51 ms /   159 tokens (  552.59 ms per token,     1.81 tokens per second)
llama_print_timings:        eval time =   21428.97 ms /    37 runs   (  579.16 ms per token,     1.73 tokens per second)
llama_print_timings:       total time =  229610.36 ms /   196 tokens
```

## Possible problems

After some testing, what can observe is:
- `cblas_sgemm` provided by Intel MKL does not use anything other than AVX2 (my CPU does not support AVX-512, damm you intel 12th gen)
- Our AVX implementation `ggml_vec_dot_` is already good enough
- We does, indeed, lost time on calling `cblas_sgemm` because all the floats must be dequantized (to_float)
- Nevertheless, the good point is that optimization `-O3` provided by oneapi compiler does give us some performance boost

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate: performance with Intel OneAPI (MKL) #5067

Motivation

Benchmarks

Possible problems

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate: performance with Intel OneAPI (MKL) #5067

Description

Motivation

Benchmarks

Possible problems

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions