-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance decreated between tag b1500 and b2581 on Windows ARM64 PC #6417
Comments
Which platform is that on? Snapdragon 8cx Gen 3 I presume? |
Hi LLAMA team, I use platform https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite. Thanks! |
If you want any chance of getting this fixed, do a git bisect to identify the exact commit that caused performance regression and notify the corresponding dev. |
Hi LLAMA team, I tried to do "git bisect" to find root reason for it, but there're huge patches added between tag b1500 and b2581. Can you please help check and analyze this issue? Thanks! |
Download the model as the original weights and convert it from that at each git bisect iteration.
As I said, if you cannot identify the commit that is causing the performance regression this has basically no chance of getting fixed. This issue seems to be hardware-specific so without the corresponding hardware it is impossible to find the bad commit. If there was a performance regression for the hardware of any of the devs they would have already reported it. |
I get significantly lower perf than expected on 8cx Gen 3 too, and can also get access to X Elite hardware. |
Hi LLAMA team, Is it related with FP16 or NEON feature? Thanks! |
Hi LLAMA team, Do you find more useful information for this issue? Thanks! |
As I've said two times already: without a git bisect from one of the affected people nothing is going to happen. |
Hi LLAMA team, I've found the commit "780e24a" between tag b1951 & b1952 caused performance decreated on Windows ARM64 PC. Please help on this issue continually. Thanks! [Commit]commit 780e24a
============================================================================== Please refer to below test results. [Test results] Command: Prompt: I have 3 years of experience as a software developer. Now I got bored with coding and want to transition to another career. My education qualifications are B. Tech in computer science, and I am well-versed in understanding the business side of software as well. Suggest a list of career options that are easy for me to transition. system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | Tag b1951 results: Tag b1952 results: Tag b2581 default results: Tag b2581 after removing commit 780e24a results: |
I am seeing virtually no difference with 6.6.19-1-MANJARO and a Ryzen 5950X. |
@Billzhong2022 what if you limit the cores used through |
@Billzhong2022 could you provide those testing results
|
Hi LLAMA team, Do you have patch to fix this issue? Thanks! |
Hi LLAMA team, Any update? Do you use MACRO "GGML_USE_OPENBLAS" for all your modified codes in commit 780e24a? @ReinForce-II Thanks! |
How about 1.? thanks. |
Hi @ReinForce-II , How about 1.? better if you can run cmd /c start /b /affinity 0x1e main.exe -t 4 ... additionally Please help on this issue with high priority. Thanks! |
This issue might to be specific to qualcomm laptop platforms. I haven't been able to reproduce it on several arm-based hardwares such as ampere, rockchip, graviton. Sorry, but I currently have no access to either 8cx gen3 or x elite, so I have no idea about it now. |
For reference: the X Elite has 3 4-core clusters. The cores across the 3 clusters are identical but the "lower power" cluster has 1/2 the link to fabric as the two other ones, and as such can only use half as much memory bandwidth. For applications that expect somewhat uniform throughput between the cores, that can cause a breakdown in expected versus realised performance. |
Hi @ReinForce-II , But after reverting commit 780e24a, the performance is much better on platform https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite. How to explain it? Thanks! |
It might have something to do with more synchronize operations in the commit |
Hi @ReinForce-II , Ok. Is it fixed? Thanks! |
Hi @ReinForce-II , Please help debug and fix this issue. Thank you very much! |
It would be great help if you can kindly provide some sampling results from snapdragon profiler |
Hi, @Billzhong2022 Please take a look at 4ae60ad8 hope for your feedback. |
Hi @ReinForce-II , After applying patch 4ae60ad8, the llama2 performance is not good on Snapdragon X Elite device. Please dig out this issue further. Thanks! Logs: |
Hi @ReinForce-II , Can you please refine commit 780e24a directly? Thank you very much! |
@ReinForce-II |
In previous code, ggml_compute_forward is called once with GGML_TASK_INIT state, by only one of the threads, the remaining threads will spin waiting it returns. Then called once with GGML_TASK_COMPUTE state, by all threads. |
Thanks @ReinForce-II for your update!
|
@quic-zhanweiw I have checked the changes. It looks like eliminated extra sync stages, 4 -> 2 per node; runs init stage single threaded. But I'm unable to determine why huge performance regression could happen according to these changes. Could you provide a flamegraph with symbols or something alternative? Or consider examining the impact of applying part of your changes.
|
@ReinForce-II |
@zhanweiw
|
Thanks @ReinForce-II very much! The below patch can fix the performance issue on WoA devices. May I get your support to mainline this change? Thanks in advance!
|
hmm, this would break support for ARMv8.0 devices (Snapdragon 835 and crew) on Windows 10
|
However, |
The command below works, may we add this to the 'README'? Add it for Windows on ARM compiling.
|
That's a solution without adding further complexity. Appears to not provide a long-term solution though. |
Hi @ReinForce-II , As commit "780e24a22eb595b705cbe8284771e9ceff1c4dd2" is ony for BLAS, what is reason you didn't use one BLAS MACRO for functions "ggml_graph_compute_thread_sync_node()" and "ggml_graph_compute_thread_sync_task()"? Function atomic_load() called from both of them. Thanks! [Code snippets]
} static void ggml_graph_compute_thread_sync_task(int * task_phase, struct ggml_compute_state * state, const bool do_yield) { while (true) {
} |
@Billzhong2022 |
After enabled OpenBlas, even we have enabled the 'armv8.1', while handling the user input prompt, it was stuck there, no response after several minutes. The 'prompt eval time' is very poor, but the 'eval time' is good. Any idea on the reason?
|
@zhanweiw eval time is not affected, since llama.cpp is not going to invoke blas library on mat mul with size < 32 (you have size 1 here) |
After enabled LSE for both OpenBLAS & Llama.cpp, the 'prompt eval time' is still poor. Enabled LSE in both OpenBLAS & Llama.cpp:
Enabled LSE in Llama.cpp without OpenBLAS:
|
0.69 -> 4.32 looks like the problem caused by bus lock has gone. Please configure the amount of threads allocated by openblas (you can use environment vars) + allocated by llama.cpp (-t arg) not greater than cores you have. BLAS libraries typically manage their own threads and spins, if you dont configure the amount of threads carefully, it can have high pressure on context switching. p.s., Using blas implementation is not likely going to have good performance on snapdragon platforms eventually. It will do mat mul in single precision, perform better in case hardware sgemm accelerator present on your platform (or something equivalent such as huge number of cpu cores). |
Thanks @ReinForce-II
It seems no benefit by enabling OpenBLAS for input prompt processing. |
@zhanweiw |
Thanks! Just tried 6+6, 4+6, 6+4, almost got the same result(12 cores devices):
|
Well, the platform may not providing much fp32 arithmetic power. How about 8+4, 8 on blas library. We'd better use llama-bench.exe to get some more detailed results. |
By checking the source code of OpenBLAS, the environment variable 'OPENBLAS_NUM_THREADS' only works for Linux system: |
It's my mistake, the environment was read here. |
But why without OpenBlas, we can get good performance(32 tokens/s) for 'prompt eval time'? |
Without OpenBlas, you are running dot product in quantized operations, not in fp32. |
Got it. Thanks so much! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hi LLAMA team,
I use llama tag b2581 on Windows ARM64 PC, the performance is more lower than previous tag b1500. Please refer to below detailed information. What is the reason? Please help on this issue.
Thanks a lot!
[Detailed information]
Command:
main.exe -m llama-2-7b-chat.ggufv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 10
Prompt: I have 3 years of experience as a software developer. Now I got bored with coding and want to transition to another career. My education qualifications are B. Tech in computer science, and I am well-versed in understanding the business side of software as well. Suggest a list of career options that are easy for me to transition.
system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
Tag b1500 results:
llama_print_timings: load time = 723.53 ms
llama_print_timings: sample time = 925.29 ms / 624 runs ( 1.48 ms per token, 674.38 tokens per second)
llama_print_timings: prompt eval time = 2583.12 ms / 91 tokens ( 28.39 ms per token, 35.23 tokens per second)
llama_print_timings: eval time = 31693.17 ms / 625 runs ( 50.71 ms per token, 19.72 tokens per second)
llama_print_timings: total time = 51797.58 ms
Tag b2581 results:
llama_print_timings: load time = 963.25 ms
llama_print_timings: sample time = 416.14 ms / 586 runs ( 0.71 ms per token, 1408.17 tokens per second)
llama_print_timings: prompt eval time = 11847.94 ms / 94 tokens ( 126.04 ms per token, 7.93 tokens per second)
llama_print_timings: eval time = 68542.50 ms / 585 runs ( 117.17 ms per token, 8.53 tokens per second)
llama_print_timings: total time = 82696.57 ms / 679 tokens
The text was updated successfully, but these errors were encountered: