-
Notifications
You must be signed in to change notification settings - Fork 12.3k
Optimize locking behavior #813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Port changes from whisper.cpp PR ggml-org#659
I haven't had the time to look in details yet and test this on my Mac. |
@@ -35,12 +35,21 @@ | |||
#include <windows.h> | |||
#endif | |||
|
|||
// if C11 or above use stdatomic.h | |||
#if __STDC_VERSION__ >= 201112L | |||
#include <stdatomic.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't work for me on VS2022. I got compile error:
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include\vcruntime_c11_stdatomic.h(36,24): error C2061: syntax error: identifier 'atomic_
bool'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... I also use VS2022. I'll check this out.
@@ -9311,7 +9439,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph) | |||
const int n_threads = cgraph->n_threads; | |||
|
|||
struct ggml_compute_state_shared state_shared = { | |||
/*.spin =*/ GGML_LOCK_INITIALIZER, | |||
/*.lock =*/ {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{0}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm used to {}
doing zero-initialization in C++. Is {0}
needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, VS2002 complains about it.
@howard0su Thanks for doing the benchmark. The results seem interesting to me. It looks like there is a general improvement when using up to 10 threads and a degradation when going further. I think I know why this is the case. There are two explanations:
To determine if a piece of code can be sped up by HyperThreading, let's understand what it actually does. HyperThreading was introduced by Intel into their processors to optimize the usage of the CPU's execution engines. Most people think that the CPU just executes instructions one, by one. Some people know that it uses pipelining and out-of-order execution. But fewer people still know that the CPU core actually isn't a single "calculator". Inside the core there are disparate "execution units" or "ports" that are very specialized and are dedicated to specific tasks (see EU in this diagram). For instance, a different piece of hardware handles simple integer arithmetic, a different handles division, a different handles floats, etc. Let's keep in mind that Intel and Windows are designed to optimize for the "general case". So let's imagine taking a random program. What does it do? How does it use the CPU? Well, this program can be written in any language, including those interpreted at runtime. Usually, programs spend most of their time not on doing actual computations, but rather on fetching data from memory. Also, they usually use each of the execution units roughly equally, intermixing integer and floating point operations. As such, what did Intel come up with? To take advantage of this unused CPU power, they allowed the OS to run two threads on the same core at once. As it is improbable that either thread is completely using the core, allowing them to run in parallel will increase throughput - though, not latency. Now, in the case of Why is the performance worse for 15 threads compared to 20 threads? I guess that it is because 15 is harder for the scheduler. 20 threads is 2x cores, so the scheduler is probably optimized for this case and is somehow able to maintain roughly the performance of 10 threads. Meanwhile, 15 threads is some random number, so the scheduler probably schedules the threads in a, basically, random order and just arbitrarily interrupts the computations and shuffles the threads around. |
@ggerganov Should this be added to the improve threading implementation project? |
I have 10c20t E5 CPU. So, 10 is definitely a magic number. |
@howard0su But 10 is the number of cores and half the number of threads. 10c20t E5 CPU has 10 cores and 20 threads... |
Port changes from whisper.cpp PR #659