threads: improve ggml_barrier scaling with large number of threads #9598

max-krasnyansky · 2024-09-23T01:51:45Z

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.

Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

max-krasnyansky · 2024-09-23T02:03:42Z

Here are some before/after numbers from our barrier test:

M2 Max

Before

~/src/llama.cpp-master$ ./build-metal-tp/bin/test-barrier 8 100
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 100
graph-compute took 381982 usec 
 3819.82 usec per-iter
 1909.91 nsec per-node

After

~/src/llama.cpp-threadpool$ ./build-metal-tp/bin/test-barrier 8 100
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 100
graph-compute took 247359 usec 
 2473.59 usec per-iter
 1236.8 nsec per-node

I'm going to test it some more on X-Elite and 64-core EPYC and remote WIP tag once it looks good.

@wtarreau
Can you please try this on your 80 Core machine?

wtarreau · 2024-09-23T02:24:34Z

Thanks! I'll test it, but I'm at a conference the whole week so it might be delayed.

Out of curiosity, why do you need the barrier at the end ? Isn't it sufficient to turn the load in the loop into a memory_order_acquire instead ?

Also regarding the use of ggml_barrier(), there can be plenty of valid use cases for requiring such a barrier, but sometimes you observe that the requirement spans over a moderately large area of code (i.e. you don't want another thread to get here before all others got somewhere else). I don't know if that's the case here, but if so, it can be effective to split the barrier in two functions:

one that performs the increment, to count that the thread passed by that point
one that waits for all threads to have completed their round.
When doing so it can amortize a bunch of the loop cost through useful work outside, but it's more complex in this case as it requires to record the status of "last thread" somewhere.

wtarreau · 2024-09-23T02:35:56Z

OK finally I figured it would be fast to try again since the code and models are still there in my local tree. I just applied the patch, reran the test, and as expected the performance is exactly the same so I'm obviously fine with your changes :-)

max-krasnyansky · 2024-09-23T05:25:37Z

Numbers are looking good on the EPYC and Ryzen.
Looks like the scalability issues we discussed in #9461 with n_threads > 8 vs OpenMP has been resolved.
Here is a run from EPYC.

llama.cpp-threadpool$ for n_threads in 4 8 16 32 64; do for i in omp tp; do echo $i ---; ./build-$i/bin/test-barrier $n_threads 100; done; done
omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 4                     n_threads: 4
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 500933 usec   graph-compute took 392548 usec
 5009.33 usec per-iter            3925.48 usec per-iter
 2504.67 nsec per-node            1962.74 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 8                     n_threads: 8
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 631755 usec   graph-compute took 607810 usec
 6317.55 usec per-iter            6078.1 usec per-iter
 3158.78 nsec per-node            3039.05 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 16                    n_threads: 16
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 1126331 usec  graph-compute took 1062131 usec
 11263.3 usec per-iter            10621.3 usec per-iter
 5631.66 nsec per-node            5310.66 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 32                    n_threads: 32
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 1949504 usec  graph-compute took 1902214 usec
 19495 usec per-iter              19022.1 usec per-iter
 9747.52 nsec per-node            9511.07 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 64                    n_threads: 64
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 2346060 usec  graph-compute took 2207780 usec
 23460.6 usec per-iter            22077.8 usec per-iter
 11730.3 nsec per-node            11038.9 nsec per-node

max-krasnyansky · 2024-09-23T05:31:07Z

Numbers from X-Elite (MS Surface 7).

 ~/src/llama.cpp-master (master)
$ ./build-arm64-windows-llvm-release/bin/test-barrier.exe 12 100
graph-compute with
 n_threads: 12
   n_nodes: 2000
  n_rounds: 100
graph-compute took 622367 usec
 6223.67 usec per-iter
 3111.84 nsec per-node

~/src/llama.cpp-threadpool (improve-ggml-barrier-scaling)
$ ./build-arm64-windows-llvm-release/bin/test-barrier.exe 12 100
graph-compute with
 n_threads: 12
   n_nodes: 2000
  n_rounds: 100
graph-compute took 328799 usec
 3287.99 usec per-iter
 1644 nsec per-node

GCC and LLVM ThreadSanitizers are happy.

max-krasnyansky · 2024-09-23T05:38:40Z

Thanks! I'll test it, but I'm at a conference the whole week so it might be delayed.

Out of curiosity, why do you need the barrier at the end ? Isn't it sufficient to turn the load in the loop into a memory_order_acquire instead ?

Yes, it would be sufficient but that's more expensive while polling.
So we poll with relaxed order (just normal load + yield) and then do one final barrier to makes sure nothing get reordered past that point.

Also regarding the use of ggml_barrier(), there can be plenty of valid use cases for requiring such a barrier, but sometimes you observe that the requirement spans over a moderately large area of code (i.e. you don't want another thread to get here before all others got somewhere else). I don't know if that's the case here, but if so, it can be effective to split the barrier in two functions:

one that performs the increment, to count that the thread passed by that point

one that waits for all threads to have completed their round.
When doing so it can amortize a bunch of the loop cost through useful work outside, but it's more complex in this case as it requires to record the status of "last thread" somewhere.

In the GGML we use the barrier to sync all threads after Graph Ops such as very large Matrix Multiplies where each thread works on a different chunk. Before the next Op can be processed all the chunks from the previous Op need to be completed.

wtarreau · 2024-09-23T05:43:32Z

Thanks! I'll test it, but I'm at a conference the whole week so it might be delayed.
Out of curiosity, why do you need the barrier at the end ? Isn't it sufficient to turn the load in the loop into a memory_order_acquire instead ?

Yes, it would be sufficient but that's more expensive while polling. So we poll with relaxed order (just normal load + yield) and then do one final barrier to makes sure nothing get reordered past that point.

I see. On x86 the code is the same (obviously) and on ARM (neoverse-N1 ampere altra), I'm not seeing any difference. But I generally agree with your point anyway. There could be some archs with heavier atomics (or worse, some who have to fall back to seqcst) where it could cost more.

Also regarding the use of ggml_barrier(), there can be plenty of valid use cases for requiring such a barrier, but sometimes you observe that the requirement spans over a moderately large area of code (i.e. you don't want another thread to get here before all others got somewhere else). I don't know if that's the case here, but if so, it can be effective to split the barrier in two functions:

one that performs the increment, to count that the thread passed by that point

one that waits for all threads to have completed their round.
When doing so it can amortize a bunch of the loop cost through useful work outside, but it's more complex in this case as it requires to record the status of "last thread" somewhere.

In the GGML we use the barrier to sync all threads after Graph Ops such as very large Matrix Multiplies where each thread works on a different chunk. Before the next Op can be processed all the chunks from the previous Op need to be completed.

OK. Thus I guess there's not even any preparatory work for the next op that the thread can start without waiting for others (e.g. resetting vectors, pre-loading values, initiating memory prefetches etc) ?

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <w@1wt.eu>

max-krasnyansky · 2024-09-23T17:29:26Z

@ggerganov @slaren
Sorry for spamming you guys with the ggml_barrier updates.
Hopefully it's the last one before we move threading stuff into C++.
This is the follow up to our previous discussion on scaling with larger number of threads.
Let me know if you guys are OK with the TSAN (thread sanitizer) "hack". I'm basically following
recommendation from the TSAN author on how to work around its limitation.

wtarreau · 2024-09-23T18:00:00Z

Don't be sorry when improving low-level code :-) FWIW I've retested it on the 80-core ARM and it still builds and runs as fast as the previous iteration (even if I had no doubt I preferred to double check). Thanks!

slaren · 2024-09-23T18:13:47Z

Performance on 13900k also looks good.

13900k

omp - 1 threads
graph-compute with
 n_threads: 1
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1011686 usec
 1011.69 usec per-iter
 505.843 nsec per-node

tp - 1 threads
graph-compute with
 n_threads: 1
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1009844 usec
 1009.84 usec per-iter
 504.922 nsec per-node

==
omp - 2 threads
graph-compute with
 n_threads: 2
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1381833 usec
 1381.83 usec per-iter
 690.917 nsec per-node

tp - 2 threads
graph-compute with
 n_threads: 2
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1238383 usec
 1238.38 usec per-iter
 619.192 nsec per-node

==
omp - 4 threads
graph-compute with
 n_threads: 4
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1546122 usec
 1546.12 usec per-iter
 773.061 nsec per-node

tp - 4 threads
graph-compute with
 n_threads: 4
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1403747 usec
 1403.75 usec per-iter
 701.874 nsec per-node

==
omp - 8 threads
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 2115183 usec
 2115.18 usec per-iter
 1057.59 nsec per-node

tp - 8 threads
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1911930 usec
 1911.93 usec per-iter
 955.965 nsec per-node

==
omp - 16 threads
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3354951 usec
 3354.95 usec per-iter
 1677.48 nsec per-node

tp - 16 threads
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3460484 usec
 3460.48 usec per-iter
 1730.24 nsec per-node

==
omp - 32 threads
graph-compute with
 n_threads: 32
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 6886784 usec
 6886.78 usec per-iter
 3443.39 nsec per-node

tp - 32 threads
graph-compute with
 n_threads: 32
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 6628458 usec
 6628.46 usec per-iter
 3314.23 nsec per-node

wtarreau · 2024-09-23T18:48:37Z

Many thanks for handling this, @max-krasnyansky.

giladgd · 2024-09-23T20:58:01Z

@max-krasnyansky With this PR I get this error when compiling on Windows using CMake, where llama.cpp is a sub-library inside of another CMake project:

Auto build dll exports
     Creating library D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.lib and object D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.exp
LINK : warning LNK4199: /DELAYLOAD:NODE.EXE ignored; no imports found from NODE.EXE [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
ggml.obj : error LNK2019: unresolved external symbol atomic_thread_fence referenced in function ggml_barrier [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\bin\Release\ggml.dll : fatal error LNK1120: 1 unresolved externals [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]

I think there might be a missing #include somewhere in this PR that causes it only to work when compiling llama.cpp directly.

max-krasnyansky · 2024-09-23T21:10:09Z

@max-krasnyansky With this PR I get this error when compiling on Windows using CMake, where llama.cpp is a sub-library inside of another CMake project:

Auto build dll exports
     Creating library D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.lib and object D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.exp
LINK : warning LNK4199: /DELAYLOAD:NODE.EXE ignored; no imports found from NODE.EXE [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
ggml.obj : error LNK2019: unresolved external symbol atomic_thread_fence referenced in function ggml_barrier [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\bin\Release\ggml.dll : fatal error LNK1120: 1 unresolved externals [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]

I think there might be a missing #include somewhere in this PR that causes it only to work when compiling llama.cpp directly.

Oh. Shoot. My bad. I completely forgot to add a stub for the function for native MSVC builds.
I wonder how come our main CI didn't catch it.

Will start another PR shortly with a 3-liner fix.

max-krasnyansky · 2024-09-23T22:34:03Z

Oh. Shoot. My bad. I completely forgot to add a stub for the function for native MSVC builds. I wonder how come our main CI didn't catch it.

Will start another PR shortly with a 3-liner fix.

#9615

…gerganov#9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <w@1wt.eu>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 23, 2024

max-krasnyansky mentioned this pull request Sep 23, 2024

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

Closed

max-krasnyansky changed the title ~~WIP: threads: improve ggml_barrier scaling with large number of threads~~ threads: improve ggml_barrier scaling with large number of threads Sep 23, 2024

max-krasnyansky force-pushed the improve-ggml-barrier-scaling branch from 7b6e983 to b42f420 Compare September 23, 2024 16:33

slaren approved these changes Sep 23, 2024

View reviewed changes

max-krasnyansky merged commit f0c7b5e into ggerganov:master Sep 23, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

threads: improve ggml_barrier scaling with large number of threads #9598

threads: improve ggml_barrier scaling with large number of threads #9598

max-krasnyansky commented Sep 23, 2024 •

edited

Loading

max-krasnyansky commented Sep 23, 2024

wtarreau commented Sep 23, 2024

wtarreau commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024 •

edited

Loading

max-krasnyansky commented Sep 23, 2024

wtarreau commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024

wtarreau commented Sep 23, 2024

slaren commented Sep 23, 2024

wtarreau commented Sep 23, 2024

giladgd commented Sep 23, 2024 •

edited

Loading

max-krasnyansky commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024

threads: improve ggml_barrier scaling with large number of threads #9598

threads: improve ggml_barrier scaling with large number of threads #9598

Conversation

max-krasnyansky commented Sep 23, 2024 • edited Loading

max-krasnyansky commented Sep 23, 2024

M2 Max

wtarreau commented Sep 23, 2024

wtarreau commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024 • edited Loading

max-krasnyansky commented Sep 23, 2024

wtarreau commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024

wtarreau commented Sep 23, 2024

slaren commented Sep 23, 2024

wtarreau commented Sep 23, 2024

giladgd commented Sep 23, 2024 • edited Loading

max-krasnyansky commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024

max-krasnyansky commented Sep 23, 2024 •

edited

Loading

max-krasnyansky commented Sep 23, 2024 •

edited

Loading

giladgd commented Sep 23, 2024 •

edited

Loading