-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate: performance with Intel OneAPI (MKL) #5067
Comments
Could you post the number using: #5045 You can also run |
I'm interested in cpu prompt processing performance, in previous tests, I didn't see such a difference between ggml_vec_dot* and blas routine, could you provide more info about your model specs, etc. ps, several test results on llama7b
|
@ggerganov Thanks for the suggestion. I updated my branch with newest master to get the #5045 . Here is the benchmark using @ReinForce-II Interesting... BTW, as far as I know, WSL2 does give good performance, but not the best (I remember 2 years ago, I compiled the whole android OS from source on WSL2, the performance was terrible). Can you maybe try using a Windows build? (or bare-bone Linux). Also I don't have an Intel Desktop CPU, I don't know if it makes a difference on desktop or not... My PC:
Command: (I did not specified With GCC 11:
Intel oneapi compiler, with MKL BLAS, max optimization (
Intel oneapi compiler, without MKL BLAS, max optimization
Edit: I did not record the performance before #5045 , but it seems to really get more performance on MKL BLAS (~8t/s versus ~7t/s) |
I tried with a windows build, it looks like w/o mkl, it performs worse compared to wsl2 somehow, and better than wsl2 w/ mkl. Maybe it's caused by awareness of p/e core configurations on bare metal system? *w/o mkl, intel dpcpp compiler
*w/ mkl, intel dpcpp compiler
bare linux performance is always terrible on my laptop, due to ec capped cpu to a low frequency. |
@ReinForce-II Thanks! Interesting to see such difference. I suppose maybe intel dpcpp on windows is not as optimized as on linux? That's just my guess anyway. Can you also test with max optimization? See the CMakeLists on my PR: https://github.com/ggerganov/llama.cpp/pull/5068/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20a @ggerganov Just for my curiosity: I remember there was a tool to benchmark all ggml operations, but I cannot find it. Do you know where it is? (or maybe it's removed at some point?) Thank you. |
|
@ngxson
|
I may need to update this investigation after #2690 is merged |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
this is really helpful. |
Motivation
Follow up to #4301 , we're now able to compile llama.cpp using Intel's OneAPI compiler and also enable Intel MKL.
Basically, the way Intel MKL works is to provide BLAS-like functions, for example
cblas_sgemm
, which inside implements Intel-specific code.In theory, that should give us better performance. However, in reality, it actually does not give any performance boost (even worse performance in some cases). For example, see: #4816
Benchmarks
NOTE:
-ipo -O3 -static -fp-model=fast
Here is some performance test using different compile configurations:
-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp
)MKL_DIRECT_CALL
andMKL_DIRECT_CALL_JIT
)(No changed compared to the test case above)
Possible problems
After some testing, what can observe is:
cblas_sgemm
provided by Intel MKL does not use anything other than AVX2 (my CPU does not support AVX-512, damm you intel 12th gen)ggml_vec_dot_
is already good enoughcblas_sgemm
because all the floats must be dequantized (to_float)-O3
provided by oneapi compiler does give us some performance boostThe text was updated successfully, but these errors were encountered: