Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance decreated between tag b1500 and b2581 on Windows ARM64 PC #6417

Closed
Billzhong2022 opened this issue Apr 1, 2024 · 54 comments
Closed
Labels
enhancement New feature or request stale

Comments

@Billzhong2022
Copy link

Hi LLAMA team,

I use llama tag b2581 on Windows ARM64 PC, the performance is more lower than previous tag b1500. Please refer to below detailed information. What is the reason? Please help on this issue.

Thanks a lot!

[Detailed information]

Command:
main.exe -m llama-2-7b-chat.ggufv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 10

Prompt: I have 3 years of experience as a software developer. Now I got bored with coding and want to transition to another career. My education qualifications are B. Tech in computer science, and I am well-versed in understanding the business side of software as well. Suggest a list of career options that are easy for me to transition.

system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

Tag b1500 results:
llama_print_timings: load time = 723.53 ms
llama_print_timings: sample time = 925.29 ms / 624 runs ( 1.48 ms per token, 674.38 tokens per second)
llama_print_timings: prompt eval time = 2583.12 ms / 91 tokens ( 28.39 ms per token, 35.23 tokens per second)
llama_print_timings: eval time = 31693.17 ms / 625 runs ( 50.71 ms per token, 19.72 tokens per second)
llama_print_timings: total time = 51797.58 ms

Tag b2581 results:
llama_print_timings: load time = 963.25 ms
llama_print_timings: sample time = 416.14 ms / 586 runs ( 0.71 ms per token, 1408.17 tokens per second)
llama_print_timings: prompt eval time = 11847.94 ms / 94 tokens ( 126.04 ms per token, 7.93 tokens per second)
llama_print_timings: eval time = 68542.50 ms / 585 runs ( 117.17 ms per token, 8.53 tokens per second)
llama_print_timings: total time = 82696.57 ms / 679 tokens

@Billzhong2022 Billzhong2022 added the enhancement New feature or request label Apr 1, 2024
@woachk
Copy link
Contributor

woachk commented Apr 1, 2024

Which platform is that on? Snapdragon 8cx Gen 3 I presume?

@Billzhong2022
Copy link
Author

@JohannesGaessler
Copy link
Collaborator

If you want any chance of getting this fixed, do a git bisect to identify the exact commit that caused performance regression and notify the corresponding dev.

@Billzhong2022
Copy link
Author

Hi LLAMA team,

I tried to do "git bisect" to find root reason for it, but there're huge patches added between tag b1500 and b2581.

Can you please help check and analyze this issue?

Thanks!

@JohannesGaessler
Copy link
Collaborator

I tried to do "git bisect" to find root reason for it, but there're huge patches added between tag b1500 and b2581.

Download the model as the original weights and convert it from that at each git bisect iteration.

Can you please help check and analyze this issue?

As I said, if you cannot identify the commit that is causing the performance regression this has basically no chance of getting fixed. This issue seems to be hardware-specific so without the corresponding hardware it is impossible to find the bad commit. If there was a performance regression for the hardware of any of the devs they would have already reported it.

@woachk
Copy link
Contributor

woachk commented Apr 3, 2024

I get significantly lower perf than expected on 8cx Gen 3 too, and can also get access to X Elite hardware.

@Billzhong2022
Copy link
Author

Hi LLAMA team,

Is it related with FP16 or NEON feature?

Thanks!

@Billzhong2022
Copy link
Author

Hi LLAMA team,

Do you find more useful information for this issue?

Thanks!

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Apr 7, 2024

As I've said two times already: without a git bisect from one of the affected people nothing is going to happen.

@Billzhong2022
Copy link
Author

Billzhong2022 commented Apr 8, 2024

Hi LLAMA team,

I've found the commit "780e24a" between tag b1951 & b1952 caused performance decreated on Windows ARM64 PC.

Please help on this issue continually.

Thanks!

[Commit]

commit 780e24a
Author: Reinforce-II fate@eastal.com
Date: Mon Jan 22 21:15:08 2024 +0800

ggml : parallelize FP32 conversion when using BLAS (#5045)

* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

==============================================================================

Please refer to below test results.

[Test results]

Command:
main.exe -m llama-2-7b-chat.ggufv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 10

Prompt: I have 3 years of experience as a software developer. Now I got bored with coding and want to transition to another career. My education qualifications are B. Tech in computer science, and I am well-versed in understanding the business side of software as well. Suggest a list of career options that are easy for me to transition.

system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

Tag b1951 results:
llama_print_timings: load time = 839.94 ms
llama_print_timings: sample time = 895.91 ms / 568 runs ( 1.58 ms per token, 633.99 tokens per second)
llama_print_timings: prompt eval time = 2770.04 ms / 93 tokens ( 29.79 ms per token, 33.57 tokens per second)
llama_print_timings: eval time = 28958.15 ms / 568 runs ( 50.98 ms per token, 19.61 tokens per second)
llama_print_timings: total time = 43121.95 ms / 661 tokens

Tag b1952 results:
llama_print_timings: load time = 1381.51 ms
llama_print_timings: sample time = 1033.81 ms / 552 runs ( 1.87 ms per token, 533.95 tokens per second)
llama_print_timings: prompt eval time = 9733.86 ms / 93 tokens ( 104.67 ms per token, 9.55 tokens per second)
llama_print_timings: eval time = 69562.43 ms / 552 runs ( 126.02 ms per token, 7.94 tokens per second)
llama_print_timings: total time = 83113.79 ms / 645 tokens

Tag b2581 default results:
llama_print_timings: load time = 963.25 ms
llama_print_timings: sample time = 416.14 ms / 586 runs ( 0.71 ms per token, 1408.17 tokens per second)
llama_print_timings: prompt eval time = 11847.94 ms / 94 tokens ( 126.04 ms per token, 7.93 tokens per second)
llama_print_timings: eval time = 68542.50 ms / 585 runs ( 117.17 ms per token, 8.53 tokens per second)
llama_print_timings: total time = 82696.57 ms / 679 tokens

Tag b2581 after removing commit 780e24a results:
llama_print_timings: load time = 1055.60 ms
llama_print_timings: sample time = 436.21 ms / 621 runs ( 0.70 ms per token, 1423.64 tokens per second)
llama_print_timings: prompt eval time = 18649.48 ms / 94 tokens ( 198.40 ms per token, 5.04 tokens per second)
llama_print_timings: eval time = 32446.21 ms / 620 runs ( 52.33 ms per token, 19.11 tokens per second)
llama_print_timings: total time = 53483.55 ms / 714 tokens

@JohannesGaessler
Copy link
Collaborator

I am seeing virtually no difference with 6.6.19-1-MANJARO and a Ryzen 5950X.

@woachk
Copy link
Contributor

woachk commented Apr 8, 2024

@Billzhong2022 what if you limit the cores used through -C to the two "big" complexes?

@ReinForce-II
Copy link
Contributor

ReinForce-II commented Apr 8, 2024

@Billzhong2022 could you provide those testing results

  1. running on a single cluster, e.g. taskset -c 0-3 <...>, or something equivalent on windows
  2. use -t 4, and set environemnt variable like OPENBLAS_NUM_THREADS=4 if you are using blas backend

@Billzhong2022
Copy link
Author

Hi LLAMA team,

Do you have patch to fix this issue?

Thanks!

@Billzhong2022
Copy link
Author

Billzhong2022 commented Apr 14, 2024

Hi LLAMA team,

Any update? Do you use MACRO "GGML_USE_OPENBLAS" for all your modified codes in commit 780e24a?

@ReinForce-II
use -t 4: The performance is not good.

Thanks!

@ReinForce-II
Copy link
Contributor

Hi LLAMA team,

Any update? Do you use MACRO "GGML_USE_OPENBLAS" for all your modified codes in commit 780e24a?

@ReinForce-II use -t 4: The performance is not good.

Thanks!

How about 1.?
You can run cmd /c start /b /affinity 0xf main.exe -t 4 ... in powershell, better if you can run cmd /c start /b /affinity 0x1e main.exe -t 4 ... additionally

thanks.

@Billzhong2022
Copy link
Author

Hi @ReinForce-II ,

How about 1.?
You can run cmd /c start /b /affinity 0xf main.exe -t 4 ... in powershell,
Answer: The performance is very very bad.

better if you can run cmd /c start /b /affinity 0x1e main.exe -t 4 ... additionally
Answer: The performance is bad.

Please help on this issue with high priority.

Thanks!

@ReinForce-II
Copy link
Contributor

Hi @ReinForce-II ,

How about 1.? You can run cmd /c start /b /affinity 0xf main.exe -t 4 ... in powershell, Answer: The performance is very very bad.

better if you can run cmd /c start /b /affinity 0x1e main.exe -t 4 ... additionally Answer: The performance is bad.

Please help on this issue with high priority.

Thanks!

This issue might to be specific to qualcomm laptop platforms. I haven't been able to reproduce it on several arm-based hardwares such as ampere, rockchip, graviton. Sorry, but I currently have no access to either 8cx gen3 or x elite, so I have no idea about it now.

@woachk
Copy link
Contributor

woachk commented Apr 16, 2024

For reference: the X Elite has 3 4-core clusters.

The cores across the 3 clusters are identical but the "lower power" cluster has 1/2 the link to fabric as the two other ones, and as such can only use half as much memory bandwidth.

For applications that expect somewhat uniform throughput between the cores, that can cause a breakdown in expected versus realised performance.

@Billzhong2022
Copy link
Author

Hi @ReinForce-II ,

But after reverting commit 780e24a, the performance is much better on platform https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite. How to explain it?

Thanks!

@ReinForce-II
Copy link
Contributor

Hi @ReinForce-II ,

But after reverting commit 780e24a, the performance is much better on platform https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite. How to explain it?

Thanks!

It might have something to do with more synchronize operations in the commit

@Billzhong2022
Copy link
Author

Hi @ReinForce-II ,

Ok. Is it fixed?

Thanks!

@Billzhong2022
Copy link
Author

Hi @ReinForce-II ,

Please help debug and fix this issue.

Thank you very much!

@ReinForce-II
Copy link
Contributor

Hi @ReinForce-II ,

Please help debug and fix this issue.

Thank you very much!

It would be great help if you can kindly provide some sampling results from snapdragon profiler
otherwise, there's a higher likelihood of the process getting stuck here, until the x elite products become publicly availiable

@ReinForce-II
Copy link
Contributor

Hi, @Billzhong2022

Please take a look at 4ae60ad8
The commit is not specified for snapdragon device, but it might also alleviate your problem.

hope for your feedback.

@Billzhong2022
Copy link
Author

Hi @ReinForce-II ,

After applying patch 4ae60ad8, the llama2 performance is not good on Snapdragon X Elite device.

Please dig out this issue further.

Thanks!

Logs:
llama_print_timings: load time = 3453.65 ms
llama_print_timings: sample time = 372.97 ms / 658 runs ( 0.57 ms per token, 1764.23 tokens per second)
llama_print_timings: prompt eval time = 15051.02 ms / 94 tokens ( 160.12 ms per token, 6.25 tokens per second)
llama_print_timings: eval time = 115842.64 ms / 657 runs ( 176.32 ms per token, 5.67 tokens per second)
llama_print_timings: total time = 143545.05 ms / 751 tokens

@Billzhong2022
Copy link
Author

Hi @ReinForce-II ,

Can you please refine commit 780e24a directly?

Thank you very much!

@quic-zhanweiw
Copy link

@ReinForce-II
It seems the performance issue is due to the function 'ggml_compute_forward()' was called twice in function 'ggml_graph_compute_thread()'. It was just called once in previous code. May we avoid this? Thanks in advance!
By the way, I haven't enable Blas feature.

@ReinForce-II
Copy link
Contributor

@ReinForce-II It seems the performance issue is due to the function 'ggml_compute_forward()' was called twice in function 'ggml_graph_compute_thread()'. It was just called once in previous code. May we avoid this? Thanks in advance! By the way, I haven't enable Blas feature.

In previous code, ggml_compute_forward is called once with GGML_TASK_INIT state, by only one of the threads, the remaining threads will spin waiting it returns. Then called once with GGML_TASK_COMPUTE state, by all threads.
After, both GGML_TASK_INIT and GGML_TASK_COMPUTE are called by all threads. In GGML_TASK_INIT state, if you are running w/o BLAS feature, one of the threads will do the work, the remainging threads returns immediately and spin waiting it.
I think there is no difference in this part.

@quic-zhanweiw
Copy link

Thanks @ReinForce-II for your update!
In the latest Llama.cpp code, after add the patch below, the performance issue disappeared. May you help check the reason?

diff --git a/ggml.c b/ggml.c
index b96a82a4..f71369f6 100644
--- a/ggml.c
+++ b/ggml.c
@@ -19494,13 +19494,13 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
 
                 params.nth = n_tasks;
 
-                if (n_tasks == 1) {
-                    /* INIT */
-                    if (GGML_OP_HAS_INIT[node->op]) {
-                        params.type = GGML_TASK_TYPE_INIT;
-                        ggml_compute_forward(&params, node);
-                    }
+                /* INIT */
+                if (GGML_OP_HAS_INIT[node->op]) {
+                    params.type = GGML_TASK_TYPE_INIT;
+                    ggml_compute_forward(&params, node);
+                }
 
+                if (n_tasks == 1) {
                     // TODO: maybe push node_n to the atomic but if other threads see n_tasks is 1,
                     // they do something more efficient than spinning (?)
                     params.type = GGML_TASK_TYPE_COMPUTE;
@@ -19524,10 +19524,10 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
             task_phase = GGML_TASK_TYPE_INIT;
             atomic_store(&state->shared->n_active,  n_threads);
             atomic_store(&state->shared->node_n,    node_n);
-            atomic_store(&state->shared->node_task, task_phase);
+            // atomic_store(&state->shared->node_task, task_phase);
         } else {
             ggml_graph_compute_thread_sync_node(&node_n,     state, false);
-            ggml_graph_compute_thread_sync_task(&task_phase, state, false);
+            // ggml_graph_compute_thread_sync_task(&task_phase, state, false);
         }
 
         // check if we should stop
@@ -19538,13 +19538,17 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
         const int n_tasks = ggml_get_n_tasks(node, n_threads, state->shared->n_threads);
 
         struct ggml_compute_params params = {
-            /*.type  =*/ GGML_TASK_TYPE_INIT,
+            /*.type  =*/ GGML_TASK_TYPE_COMPUTE,
             /*.ith   =*/ state->ith,
             /*.nth   =*/ n_tasks,
             /*.wsize =*/ cplan->work_size,
             /*.wdata =*/ cplan->work_data,
         };
 
+        if (state->ith < n_tasks) {
+            ggml_compute_forward(&params, node);
+        }
+/*
         if (state->ith < n_tasks) {
             if (GGML_OP_HAS_INIT[node->op]) {
                 ggml_compute_forward(&params, node);
@@ -19579,6 +19583,7 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
         else {
             ggml_graph_compute_thread_sync_task(&task_phase, state, false);
         }
+*/
     }
 
     return 0;

@ReinForce-II
Copy link
Contributor

ReinForce-II commented May 20, 2024

@quic-zhanweiw
Thanks looking into the problem.

I have checked the changes. It looks like eliminated extra sync stages, 4 -> 2 per node; runs init stage single threaded. But I'm unable to determine why huge performance regression could happen according to these changes.

Could you provide a flamegraph with symbols or something alternative?

Or consider examining the impact of applying part of your changes.

From 48283bb99ffd21fcd1637b4116d81c801506fbcb Mon Sep 17 00:00:00 2001
From: Reinforce-II <fate@eastal.com>
Date: Mon, 20 May 2024 05:51:33 +0000
Subject: [PATCH] -

---
 ggml.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ggml.c b/ggml.c
index 53da231e..949d4e33 100644
--- a/ggml.c
+++ b/ggml.c
@@ -19834,6 +19834,7 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
             ggml_compute_forward(&params, node, state);
         }

+        /*
         if (atomic_fetch_sub(&state->shared->n_active, 1) == 1) {
             task_phase = GGML_TASK_TYPE_FINALIZE;
             atomic_store(&state->shared->n_active,  n_threads);
@@ -19842,6 +19843,7 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
         else {
             ggml_graph_compute_thread_sync_task(&task_phase, state, false);
         }
+        */
     }

     return 0;
--
2.32.0.windows.2
From a21f0f623d2ee1460e892803c2d83fa14f7acd20 Mon Sep 17 00:00:00 2001
From: Reinforce-II <fate@eastal.com>
Date: Mon, 20 May 2024 05:52:12 +0000
Subject: [PATCH] -

---
 ggml.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/ggml.c b/ggml.c
index 53da231e..0642f57c 100644
--- a/ggml.c
+++ b/ggml.c
@@ -19820,13 +19820,7 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
             atomic_store(&state->shared->node_task, task_phase);
         }
         else {
-            // TODO: this sched_yield can have significant impact on the performance - either positive or negative
-            //       depending on the workload and the operating system.
-            //       since it is not clear what is the best approach, it should potentially become user-configurable
-            //       ref: https://github.com/ggerganov/ggml/issues/291
-            // UPD:  adding the do_yield flag seems to resolve the issue universally
-            const bool do_yield = node_n < 0 || cgraph->nodes[node_n]->op == GGML_OP_MUL_MAT;
-            ggml_graph_compute_thread_sync_task(&task_phase, state, do_yield);
+            ggml_graph_compute_thread_sync_task(&task_phase, state, false);
         }

         if (state->ith < n_tasks) {
--
2.32.0.windows.2

@zhanweiw
Copy link

llama

@ReinForce-II
Your patch can't fix this issue.
According to the picture, most of the time are cost in 'atomic_load' function.

@ReinForce-II
Copy link
Contributor

ReinForce-II commented May 20, 2024

@zhanweiw
Thank you, the picture makes great help!
It looks like LSE atomics extension is not enabled or used, please check compiler option used in msvc.

link

The default is /arch:armv8.0
For example, /arch:armv8.1 allows the _Interlocked* intrinsic functions to use the appropriate atomic instruction that was introduced with the ARMv8.1 extension, FEAT_LSE, but compiler support requires Visual Studio 2022 version 17.2 or later.

@quic-zhanweiw
Copy link

Thanks @ReinForce-II very much!

The below patch can fix the performance issue on WoA devices. May I get your support to mainline this change? Thanks in advance!

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9cc60039..a365a40b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -999,6 +999,7 @@ if (CMAKE_OSX_ARCHITECTURES STREQUAL "arm64" OR CMAKE_GENERATOR_PLATFORM_LWR STR
         add_compile_definitions(__aarch64__) # MSVC defines _M_ARM64 instead
         add_compile_definitions(__ARM_NEON)
         add_compile_definitions(__ARM_FEATURE_FMA)
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:armv8.1")

         set(CMAKE_REQUIRED_FLAGS_PREV ${CMAKE_REQUIRED_FLAGS})
         string(JOIN " " CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS} "/arch:armv8.2")

@woachk
Copy link
Contributor

woachk commented May 21, 2024

hmm, this would break support for ARMv8.0 devices (Snapdragon 835 and crew) on Windows 10

IsProcessorFeaturePresent(PF_ARM_V81_ATOMIC_INSTRUCTIONS_AVAILABLE) is how to do a feature check if support for those doesn't matter

@ReinForce-II
Copy link
Contributor

However, IsProcessorFeaturePresent is runtime checking, could we consider add option to build multiple variants of executable, akin to the x64 approach ?

@quic-zhanweiw
Copy link

The command below works, may we add this to the 'README'? Add it for Windows on ARM compiling.

cmake .. -A ARM64 -D CMAKE_CXX_FLAGS="/arch:armv8.1"

@ReinForce-II
Copy link
Contributor

That's a solution without adding further complexity. Appears to not provide a long-term solution though.
What if we seek the opinions of the maintainers?

@Billzhong2022
Copy link
Author

Billzhong2022 commented May 22, 2024

Hi @ReinForce-II ,

As commit "780e24a22eb595b705cbe8284771e9ceff1c4dd2" is ony for BLAS, what is reason you didn't use one BLAS MACRO for functions "ggml_graph_compute_thread_sync_node()" and "ggml_graph_compute_thread_sync_task()"? Function atomic_load() called from both of them.

Thanks!

[Code snippets]
static void ggml_graph_compute_thread_sync_node(int * node_n, struct ggml_compute_state * state, const bool do_yield) {
// wait for other threads to finish
const int last_node_n = * node_n;

while (true) {
    if (do_yield) {
        sched_yield();
    }

    * node_n = atomic_load(&state->shared->node_n);
    if (* node_n != last_node_n) break;
}

}

static void ggml_graph_compute_thread_sync_task(int * task_phase, struct ggml_compute_state * state, const bool do_yield) {
// wait for other threads to finish
const int last_task_phase = * task_phase;

while (true) {
if (do_yield) {
sched_yield();
}

    * task_phase = atomic_load(&state->shared->node_task);
    if (* task_phase != last_task_phase) break;
}

}

@ReinForce-II
Copy link
Contributor

@Billzhong2022
Using BLAS macro can lead to unnecessary discrepancies in code execution between w/ and w/o BLAS as it's not expected to have observable difference in performance there.
It seems to make sense now.

@zhanweiw
Copy link

@ReinForce-II

llama

After enabled OpenBlas, even we have enabled the 'armv8.1', while handling the user input prompt, it was stuck there, no response after several minutes.

llama2

The 'prompt eval time' is very poor, but the 'eval time' is good. Any idea on the reason?

llama_print_timings:        load time =    1828.46 ms
llama_print_timings:      sample time =      10.72 ms /   113 runs   (    0.09 ms per token, 10538.10 tokens per second)
llama_print_timings: prompt eval time =  134070.93 ms /    93 tokens ( 1441.62 ms per token,     0.69 tokens per second)
llama_print_timings:        eval time =    5422.03 ms /   112 runs   (   48.41 ms per token,    20.66 tokens per second)
llama_print_timings:       total time =  141149.61 ms /   205 tokens

@ReinForce-II
Copy link
Contributor

ReinForce-II commented May 22, 2024

@zhanweiw
Please check your openblas is also built with LSE enabled, it's going to lock bus to implement synchorization operations otherwise.

eval time is not affected, since llama.cpp is not going to invoke blas library on mat mul with size < 32 (you have size 1 here)

@zhanweiw
Copy link

After enabled LSE for both OpenBLAS & Llama.cpp, the 'prompt eval time' is still poor.

Enabled LSE in both OpenBLAS & Llama.cpp:

llama_print_timings:        load time =    1879.53 ms
llama_print_timings:      sample time =      10.98 ms /   103 runs   (    0.11 ms per token,  9384.11 tokens per second)
llama_print_timings: prompt eval time =   21516.55 ms /    93 tokens (  231.36 ms per token,     4.32 tokens per second)
llama_print_timings:        eval time =    5072.23 ms /   102 runs   (   49.73 ms per token,    20.11 tokens per second)

Enabled LSE in Llama.cpp without OpenBLAS:

llama_print_timings:        load time =    1816.11 ms
llama_print_timings:      sample time =      26.30 ms /   259 runs   (    0.10 ms per token,  9849.78 tokens per second)
llama_print_timings: prompt eval time =    2875.31 ms /    93 tokens (   30.92 ms per token,    32.34 tokens per second)
llama_print_timings:        eval time =   13077.06 ms /   258 runs   (   50.69 ms per token,    19.73 tokens per second)

@ReinForce-II
Copy link
Contributor

ReinForce-II commented May 23, 2024

0.69 -> 4.32 looks like the problem caused by bus lock has gone.

Please configure the amount of threads allocated by openblas (you can use environment vars) + allocated by llama.cpp (-t arg) not greater than cores you have. BLAS libraries typically manage their own threads and spins, if you dont configure the amount of threads carefully, it can have high pressure on context switching.

p.s., Using blas implementation is not likely going to have good performance on snapdragon platforms eventually. It will do mat mul in single precision, perform better in case hardware sgemm accelerator present on your platform (or something equivalent such as huge number of cpu cores).

@zhanweiw
Copy link

Thanks @ReinForce-II
I've already set thread to 10 for both OpenBlas & Llama.cpp in previous test:

C:\llm\llama.cpp>Set OPENBLAS_NUM_THREADS=10
C:\llm\llama.cpp>main.exe -m models\llama-2-7b-chat.q4_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.0 --repeat_penalty 1.0 -t 10

It seems no benefit by enabling OpenBLAS for input prompt processing.

@ReinForce-II
Copy link
Contributor

ReinForce-II commented May 23, 2024

@zhanweiw
Sorry for not making it clear, I means set threads allocated by OpenBLAS + llama.cpp not greater than cores you have, e.g. 6+6, 8+4, rather than 10+10.

@zhanweiw
Copy link

zhanweiw commented May 23, 2024

Thanks!

Just tried 6+6, 4+6, 6+4, almost got the same result(12 cores devices):

C:\llm\llama.cpp>Set OPENBLAS_NUM_THREADS=6
C:\llm\llama.cpp>main.exe -m models\llama-2-7b-chat.q4_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.0 --repeat_penalty 1.0 -t 6

@ReinForce-II
Copy link
Contributor

ReinForce-II commented May 23, 2024

Well, the platform may not providing much fp32 arithmetic power. How about 8+4, 8 on blas library. We'd better use llama-bench.exe to get some more detailed results.

@zhanweiw
Copy link

By checking the source code of OpenBLAS, the environment variable 'OPENBLAS_NUM_THREADS' only works for Linux system:
https://github.com/OpenMathLib/OpenBLAS/blob/700ea74a378cb5bf9073b4447a089a029131fb8b/driver/others/init.c#L820

@zhanweiw
Copy link

It's my mistake, the environment was read here.
https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/openblas_env.c#L75

@zhanweiw
Copy link

Well, the platform may not providing much fp32 arithmetic power. How about 8+4, 8 on blas library. We'd better use llama-bench.exe to get some more detailed results.

But why without OpenBlas, we can get good performance(32 tokens/s) for 'prompt eval time'?

@ReinForce-II
Copy link
Contributor

Well, the platform may not providing much fp32 arithmetic power. How about 8+4, 8 on blas library. We'd better use llama-bench.exe to get some more detailed results.

But why without OpenBlas, we can get good performance(32 tokens/s) for 'prompt eval time'?

Without OpenBlas, you are running dot product in quantized operations, not in fp32.

@zhanweiw
Copy link

Got it. Thanks so much!

Copy link
Contributor

github-actions bot commented Jul 8, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

6 participants