Optimize locking behavior #813

janekb04 · 2023-04-06T14:19:10Z

Port changes from whisper.cpp PR #659

Port changes from whisper.cpp PR ggml-org#659

ggerganov · 2023-04-13T13:22:57Z

I haven't had the time to look in details yet and test this on my Mac.
In the meantime, it would be nice to get some reports from other people about how the token generation time is affected by these changes

ggml.c

howard0su · 2023-04-16T15:39:00Z

ggml.c

@@ -35,12 +35,21 @@
 #include <windows.h>
 #endif

+// if C11 or above use stdatomic.h
+#if __STDC_VERSION__ >= 201112L
+#include <stdatomic.h>


this doesn't work for me on VS2022. I got compile error:
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include\vcruntime_c11_stdatomic.h(36,24): error C2061: syntax error: identifier 'atomic_
bool'

Interesting... I also use VS2022. I'll check this out.

ggml.c

howard0su · 2023-04-16T15:41:19Z

ggml.c

@@ -9311,7 +9439,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
    const int n_threads = cgraph->n_threads;

    struct ggml_compute_state_shared state_shared = {
-        /*.spin      =*/ GGML_LOCK_INITIALIZER,
+        /*.lock      =*/ {},


I'm used to {} doing zero-initialization in C++. Is {0} needed here?

Yes, VS2002 complains about it.

howard0su · 2023-04-16T15:42:09Z

I run a perf testing on Windows 10, AVX, 10c20t box:

janekb04 · 2023-04-17T08:05:30Z

@howard0su Thanks for doing the benchmark. The results seem interesting to me.

It looks like there is a general improvement when using up to 10 threads and a degradation when going further. I think I know why this is the case. There are two explanations:

You don't have HyperThreading enabled (but you probably do)
This code isn't suitable for HyperThreading.

To determine if a piece of code can be sped up by HyperThreading, let's understand what it actually does. HyperThreading was introduced by Intel into their processors to optimize the usage of the CPU's execution engines. Most people think that the CPU just executes instructions one, by one. Some people know that it uses pipelining and out-of-order execution. But fewer people still know that the CPU core actually isn't a single "calculator". Inside the core there are disparate "execution units" or "ports" that are very specialized and are dedicated to specific tasks (see EU in this diagram). For instance, a different piece of hardware handles simple integer arithmetic, a different handles division, a different handles floats, etc. Let's keep in mind that Intel and Windows are designed to optimize for the "general case".

So let's imagine taking a random program. What does it do? How does it use the CPU? Well, this program can be written in any language, including those interpreted at runtime. Usually, programs spend most of their time not on doing actual computations, but rather on fetching data from memory. Also, they usually use each of the execution units roughly equally, intermixing integer and floating point operations. As such, what did Intel come up with? To take advantage of this unused CPU power, they allowed the OS to run two threads on the same core at once. As it is improbable that either thread is completely using the core, allowing them to run in parallel will increase throughput - though, not latency.

Now, in the case of llama.cpp, the story is different. Each thread is constantly doing heavy floating-point calculations. When a given thread is running, it is using the floating point execution unit and SIMD at 100%. But that's OK. This is what HyperThreading is for: the OS will run a second thread in parallel that will use the other execution units... except, it wont. Because the other thread will also be most likely a llama.cpp thread that also uses the CPU's floating point engines at 100%. So, what will happen in reality is that these two threads will contend for the same execution units. And naturally, this will cause various stalls and waits (on the microarchitectural level). Additionally, there will be many context switches (as there are more threads than cores), while if there would be as many threads as there are cores, a single thread could peacefully run for longer. A context switch is detrimental to performance as it basically invalidates the L1 and L2 caches, the register files, etc. This is because the new thread will use its own matrix/tensor and most likely evict the old one.

Why is the performance worse for 15 threads compared to 20 threads? I guess that it is because 15 is harder for the scheduler. 20 threads is 2x cores, so the scheduler is probably optimized for this case and is somehow able to maintain roughly the performance of 10 threads. Meanwhile, 15 threads is some random number, so the scheduler probably schedules the threads in a, basically, random order and just arbitrarily interrupts the computations and shuffles the threads around.

janekb04 · 2023-04-17T08:52:48Z

@ggerganov Should this be added to the improve threading implementation project?

howard0su · 2023-04-17T13:52:42Z

I have 10c20t E5 CPU. So, 10 is definitely a magic number.

janekb04 · 2023-04-17T14:25:20Z

@howard0su But 10 is the number of cores and half the number of threads. 10c20t E5 CPU has 10 cores and 20 threads...

Optimize locking behavior

709a958

Port changes from whisper.cpp PR ggml-org#659

janekb04 mentioned this pull request Apr 7, 2023

ggml: refactor compute thread: merge three spin variables into one #816

Closed

ggerganov added the threading Parallel processing and thread management label Apr 14, 2023

howard0su reviewed Apr 16, 2023

View reviewed changes

Incorporate feedback from @howard0su

3dc5243

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize locking behavior #813

Optimize locking behavior #813

Uh oh!

janekb04 commented Apr 6, 2023

Uh oh!

ggerganov commented Apr 13, 2023

Uh oh!

Uh oh!

howard0su Apr 16, 2023

Uh oh!

janekb04 Apr 17, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

howard0su Apr 16, 2023

Uh oh!

janekb04 Apr 17, 2023

Uh oh!

howard0su Apr 17, 2023

Uh oh!

howard0su commented Apr 16, 2023

Uh oh!

janekb04 commented Apr 17, 2023

Uh oh!

janekb04 commented Apr 17, 2023

Uh oh!

howard0su commented Apr 17, 2023

Uh oh!

janekb04 commented Apr 17, 2023

Uh oh!

Uh oh!

Optimize locking behavior #813

Are you sure you want to change the base?

Optimize locking behavior #813

Uh oh!

Conversation

janekb04 commented Apr 6, 2023

Uh oh!

ggerganov commented Apr 13, 2023

Uh oh!

Uh oh!

howard0su Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

janekb04 Apr 17, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

howard0su Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

janekb04 Apr 17, 2023

Choose a reason for hiding this comment

Uh oh!

howard0su Apr 17, 2023

Choose a reason for hiding this comment

Uh oh!

howard0su commented Apr 16, 2023

Uh oh!

janekb04 commented Apr 17, 2023

Uh oh!

janekb04 commented Apr 17, 2023

Uh oh!

howard0su commented Apr 17, 2023

Uh oh!

janekb04 commented Apr 17, 2023

Uh oh!

Uh oh!