Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate: performance with Intel OneAPI (MKL) #5067

Closed
ngxson opened this issue Jan 22, 2024 · 11 comments
Closed

Investigate: performance with Intel OneAPI (MKL) #5067

ngxson opened this issue Jan 22, 2024 · 11 comments
Labels

Comments

@ngxson
Copy link
Collaborator

ngxson commented Jan 22, 2024

Motivation

Follow up to #4301 , we're now able to compile llama.cpp using Intel's OneAPI compiler and also enable Intel MKL.

Basically, the way Intel MKL works is to provide BLAS-like functions, for example cblas_sgemm, which inside implements Intel-specific code.

In theory, that should give us better performance. However, in reality, it actually does not give any performance boost (even worse performance in some cases). For example, see: #4816

Benchmarks

NOTE:

  • My rig: Framework Laptop 13 | 32GB RAM | Intel 1260P
  • When compile with intel oneapi, I always set maximum optimization level -ipo -O3 -static -fp-model=fast

Here is some performance test using different compile configurations:

  • Baseline: Compile with gcc 11
llama_print_timings:        load time =     638,31 ms
llama_print_timings:      sample time =      20,02 ms /    48 runs   (    0,42 ms per token,  2397,24 tokens per second)
llama_print_timings: prompt eval time =   22078,89 ms /   159 tokens (  138,86 ms per token,     7,20 tokens per second)
llama_print_timings:        eval time =    8162,45 ms /    47 runs   (  173,67 ms per token,     5,76 tokens per second)
llama_print_timings:       total time =   34068,72 ms /   206 tokens
  • Compile with Intel MKL BLAS (-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp)
llama_print_timings:        load time =     506.60 ms
llama_print_timings:      sample time =      16.19 ms /    62 runs   (    0.26 ms per token,  3830.23 tokens per second)
llama_print_timings: prompt eval time =   26864.97 ms /   159 tokens (  168.96 ms per token,     5.92 tokens per second)
llama_print_timings:        eval time =    8180.71 ms /    61 runs   (  134.11 ms per token,     7.46 tokens per second)
llama_print_timings:       total time =   45169.44 ms /   220 tokens
  • Compile with Intel MKL BLAS (same as above, but also add MKL_DIRECT_CALL and MKL_DIRECT_CALL_JIT)

(No changed compared to the test case above)

  • Compile without Intel MKL BLAS (⭐ BEST PERFORMANCE)
llama_print_timings:        load time =     505.97 ms
llama_print_timings:      sample time =      28.74 ms /   111 runs   (    0.26 ms per token,  3862.48 tokens per second)
llama_print_timings: prompt eval time =   13399.88 ms /   159 tokens (   84.28 ms per token,    11.87 tokens per second)
llama_print_timings:        eval time =   13088.77 ms /   110 runs   (  118.99 ms per token,     8.40 tokens per second)
llama_print_timings:       total time =   30486.68 ms /   269 tokens
  • Purely for fun, with AVX and SSE both disabled:
llama_print_timings:        load time =    1464.94 ms
llama_print_timings:      sample time =       9.49 ms /    38 runs   (    0.25 ms per token,  4004.64 tokens per second)
llama_print_timings: prompt eval time =   87862.51 ms /   159 tokens (  552.59 ms per token,     1.81 tokens per second)
llama_print_timings:        eval time =   21428.97 ms /    37 runs   (  579.16 ms per token,     1.73 tokens per second)
llama_print_timings:       total time =  229610.36 ms /   196 tokens

Possible problems

After some testing, what can observe is:

  • cblas_sgemm provided by Intel MKL does not use anything other than AVX2 (my CPU does not support AVX-512, damm you intel 12th gen)
  • Our AVX implementation ggml_vec_dot_ is already good enough
  • We does, indeed, lost time on calling cblas_sgemm because all the floats must be dequantized (to_float)
  • Nevertheless, the good point is that optimization -O3 provided by oneapi compiler does give us some performance boost
@ggerganov
Copy link
Owner

Could you post the number using: #5045

You can also run llama-bench for more accurate benchmark

@ReinForce-II
Copy link
Contributor

I'm interested in cpu prompt processing performance, in previous tests, I didn't see such a difference between ggml_vec_dot* and blas routine, could you provide more info about your model specs, etc.

ps, several test results on llama7b


  • platform: zephyrus 16 | 32g ddr5 4800 | intel 12700h
  • powercfg: windows controlled, balanced preset
  • os: wsl2 (ubuntu 22.04) on win11
  • oneapi toolset: 2024.*

  • baseline : compiled with gcc11

› ./llama-bench -m Llama-2-7b-chat-q4km.gguf -t 1,2,4,8,14 -n 0

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 1 pp 512 5.68 ± 0.04
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 2 pp 512 11.09 ± 0.12
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 4 pp 512 19.96 ± 0.29
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 pp 512 20.28 ± 0.21
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 14 pp 512 23.86 ± 0.09

build: b2d80e10 (1950)

  • cfg1 : compiled with oneapi dpcpp, w/o mkl, optimization flags not modified

› ./llama-bench -m Llama-2-7b-chat-q4km.gguf -t 1,2,4,8,14 -n 0

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 1 pp 512 5.89 ± 0.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 2 pp 512 11.61 ± 0.17
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 4 pp 512 21.20 ± 0.31
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 pp 512 21.70 ± 0.76
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 14 pp 512 26.22 ± 0.13

build: b2d80e10 (1950)

  • cfg2 : compiled with oneapi dpcpp, w/ mkl -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp, optimization flags not modified

› ./llama-bench -m Llama-2-7b-chat-q4km.gguf -t 1,2,4,8,14 -n 0

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 1 pp 512 23.46 ± 0.57
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 2 pp 512 23.78 ± 0.45
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 4 pp 512 20.97 ± 1.91
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 8 pp 512 23.93 ± 0.77
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 14 pp 512 24.27 ± 0.22

build: b2d80e10 (1950)

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 22, 2024

@ggerganov Thanks for the suggestion. I updated my branch with newest master to get the #5045 . Here is the benchmark using llama-bench

@ReinForce-II Interesting... BTW, as far as I know, WSL2 does give good performance, but not the best (I remember 2 years ago, I compiled the whole android OS from source on WSL2, the performance was terrible). Can you maybe try using a Windows build? (or bare-bone Linux).

Also I don't have an Intel Desktop CPU, I don't know if it makes a difference on desktop or not...

My PC:

  • Framework Laptop 13
  • CPU Intel 1260P
  • RAM: 32GB DDR4 dual channel 3200MHz
  • OS: Fedora Silverblue 39 - Linux kernel 6.6.8-200.fc39.x86_64

Command: ./llama-bench -m ../dolphin-2.0-mistral-7b.Q4_K_M.gguf

(I did not specified -t, because 1 and 2 threads config is slow and I ran out of patience)

With GCC 11:

model size params backend threads test t/s
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU 8 pp 512 6.87 ± 0.04
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU 8 tg 128 6.07 ± 0.03

Intel oneapi compiler, with MKL BLAS, max optimization (-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DLLAMA_NATIVE=ON)

model size params backend threads test t/s
llama 7B Q4_K - Medium 4.07 GiB 7.24 B BLAS 4 pp 512 8.59 ± 0.06
llama 7B Q4_K - Medium 4.07 GiB 7.24 B BLAS 8 pp 512 8.33 ± 0.42

Intel oneapi compiler, without MKL BLAS, max optimization

model size params backend threads test t/s
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU 4 pp 512 10.87 ± 0.44
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU 8 pp 512 11.65 ± 0.01

Edit: I did not record the performance before #5045 , but it seems to really get more performance on MKL BLAS (~8t/s versus ~7t/s)

@ReinForce-II
Copy link
Contributor

I tried with a windows build, it looks like w/o mkl, it performs worse compared to wsl2 somehow, and better than wsl2 w/ mkl. Maybe it's caused by awareness of p/e core configurations on bare metal system?
Anyway, it's a point produce noticeable changes, i'm will have a deeper look into this.

*w/o mkl, intel dpcpp compiler

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 4 pp 512 18.29 ± 1.66
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 pp 512 19.31 ± 0.53
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 14 pp 512 23.99 ± 0.28

build: b2d80e10 (1950)

*w/ mkl, intel dpcpp compiler

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 4 pp 512 30.04 ± 0.47

build: b2d80e10 (1950)

bare linux performance is always terrible on my laptop, due to ec capped cpu to a low frequency.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 23, 2024

@ReinForce-II Thanks! Interesting to see such difference. I suppose maybe intel dpcpp on windows is not as optimized as on linux? That's just my guess anyway.

Can you also test with max optimization? See the CMakeLists on my PR: https://github.com/ggerganov/llama.cpp/pull/5068/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20a

@ggerganov Just for my curiosity: I remember there was a tool to benchmark all ggml operations, but I cannot find it. Do you know where it is? (or maybe it's removed at some point?) Thank you.

@ggerganov
Copy link
Owner

make -j tests && ./tests/test-backend-ops perf

@ReinForce-II
Copy link
Contributor

@ngxson
i enabled optimization flags have an equivalent on dpcpp compiler for windows.
got faster slightly

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 4 pp 512 19.05 ± 0.15
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 pp 512 20.76 ± 0.73
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 14 pp 512 25.17 ± 0.15

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 27, 2024

I may need to update this investigation after #2690 is merged

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
@zhouwg
Copy link
Contributor

zhouwg commented Apr 26, 2024

make -j tests && ./tests/test-backend-ops perf

this is really helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants