Skip to content

Optimize locking behavior #813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

janekb04
Copy link

@janekb04 janekb04 commented Apr 6, 2023

Port changes from whisper.cpp PR #659

Port changes from whisper.cpp PR ggml-org#659
@ggerganov
Copy link
Member

I haven't had the time to look in details yet and test this on my Mac.
In the meantime, it would be nice to get some reports from other people about how the token generation time is affected by these changes

@ggerganov ggerganov added the threading Parallel processing and thread management label Apr 14, 2023
@@ -35,12 +35,21 @@
#include <windows.h>
#endif

// if C11 or above use stdatomic.h
#if __STDC_VERSION__ >= 201112L
#include <stdatomic.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't work for me on VS2022. I got compile error:
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include\vcruntime_c11_stdatomic.h(36,24): error C2061: syntax error: identifier 'atomic_
bool'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... I also use VS2022. I'll check this out.

@@ -9311,7 +9439,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
const int n_threads = cgraph->n_threads;

struct ggml_compute_state_shared state_shared = {
/*.spin =*/ GGML_LOCK_INITIALIZER,
/*.lock =*/ {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{0}?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm used to {} doing zero-initialization in C++. Is {0} needed here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, VS2002 complains about it.

@howard0su
Copy link
Collaborator

I run a perf testing on Windows 10, AVX, 10c20t box:
Figure_1

@janekb04
Copy link
Author

@howard0su Thanks for doing the benchmark. The results seem interesting to me.

It looks like there is a general improvement when using up to 10 threads and a degradation when going further. I think I know why this is the case. There are two explanations:

  1. You don't have HyperThreading enabled (but you probably do)
  2. This code isn't suitable for HyperThreading.

To determine if a piece of code can be sped up by HyperThreading, let's understand what it actually does. HyperThreading was introduced by Intel into their processors to optimize the usage of the CPU's execution engines. Most people think that the CPU just executes instructions one, by one. Some people know that it uses pipelining and out-of-order execution. But fewer people still know that the CPU core actually isn't a single "calculator". Inside the core there are disparate "execution units" or "ports" that are very specialized and are dedicated to specific tasks (see EU in this diagram). For instance, a different piece of hardware handles simple integer arithmetic, a different handles division, a different handles floats, etc. Let's keep in mind that Intel and Windows are designed to optimize for the "general case".

So let's imagine taking a random program. What does it do? How does it use the CPU? Well, this program can be written in any language, including those interpreted at runtime. Usually, programs spend most of their time not on doing actual computations, but rather on fetching data from memory. Also, they usually use each of the execution units roughly equally, intermixing integer and floating point operations. As such, what did Intel come up with? To take advantage of this unused CPU power, they allowed the OS to run two threads on the same core at once. As it is improbable that either thread is completely using the core, allowing them to run in parallel will increase throughput - though, not latency.

Now, in the case of llama.cpp, the story is different. Each thread is constantly doing heavy floating-point calculations. When a given thread is running, it is using the floating point execution unit and SIMD at 100%. But that's OK. This is what HyperThreading is for: the OS will run a second thread in parallel that will use the other execution units... except, it wont. Because the other thread will also be most likely a llama.cpp thread that also uses the CPU's floating point engines at 100%. So, what will happen in reality is that these two threads will contend for the same execution units. And naturally, this will cause various stalls and waits (on the microarchitectural level). Additionally, there will be many context switches (as there are more threads than cores), while if there would be as many threads as there are cores, a single thread could peacefully run for longer. A context switch is detrimental to performance as it basically invalidates the L1 and L2 caches, the register files, etc. This is because the new thread will use its own matrix/tensor and most likely evict the old one.

Why is the performance worse for 15 threads compared to 20 threads? I guess that it is because 15 is harder for the scheduler. 20 threads is 2x cores, so the scheduler is probably optimized for this case and is somehow able to maintain roughly the performance of 10 threads. Meanwhile, 15 threads is some random number, so the scheduler probably schedules the threads in a, basically, random order and just arbitrarily interrupts the computations and shuffles the threads around.

@janekb04
Copy link
Author

@ggerganov Should this be added to the improve threading implementation project?

@howard0su
Copy link
Collaborator

I have 10c20t E5 CPU. So, 10 is definitely a magic number.

@janekb04
Copy link
Author

@howard0su But 10 is the number of cores and half the number of threads. 10c20t E5 CPU has 10 cores and 20 threads...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
threading Parallel processing and thread management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants