Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threads: improve ggml_barrier scaling with large number of threads #9598

Merged

Conversation

max-krasnyansky
Copy link
Collaborator

@max-krasnyansky max-krasnyansky commented Sep 23, 2024

Fixes: #9588

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.


Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 23, 2024
@max-krasnyansky
Copy link
Collaborator Author

Here are some before/after numbers from our barrier test:

M2 Max

Before

~/src/llama.cpp-master$ ./build-metal-tp/bin/test-barrier 8 100
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 100
graph-compute took 381982 usec 
 3819.82 usec per-iter
 1909.91 nsec per-node

After

~/src/llama.cpp-threadpool$ ./build-metal-tp/bin/test-barrier 8 100
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 100
graph-compute took 247359 usec 
 2473.59 usec per-iter
 1236.8 nsec per-node

I'm going to test it some more on X-Elite and 64-core EPYC and remote WIP tag once it looks good.

@wtarreau
Can you please try this on your 80 Core machine?

@wtarreau
Copy link
Contributor

Thanks! I'll test it, but I'm at a conference the whole week so it might be delayed.

Out of curiosity, why do you need the barrier at the end ? Isn't it sufficient to turn the load in the loop into a memory_order_acquire instead ?

Also regarding the use of ggml_barrier(), there can be plenty of valid use cases for requiring such a barrier, but sometimes you observe that the requirement spans over a moderately large area of code (i.e. you don't want another thread to get here before all others got somewhere else). I don't know if that's the case here, but if so, it can be effective to split the barrier in two functions:

  • one that performs the increment, to count that the thread passed by that point
  • one that waits for all threads to have completed their round.
    When doing so it can amortize a bunch of the loop cost through useful work outside, but it's more complex in this case as it requires to record the status of "last thread" somewhere.

@wtarreau
Copy link
Contributor

OK finally I figured it would be fast to try again since the code and models are still there in my local tree. I just applied the patch, reran the test, and as expected the performance is exactly the same so I'm obviously fine with your changes :-)

@max-krasnyansky
Copy link
Collaborator Author

Numbers are looking good on the EPYC and Ryzen.
Looks like the scalability issues we discussed in #9461 with n_threads > 8 vs OpenMP has been resolved.
Here is a run from EPYC.

llama.cpp-threadpool$ for n_threads in 4 8 16 32 64; do for i in omp tp; do echo $i ---; ./build-$i/bin/test-barrier $n_threads 100; done; done
omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 4                     n_threads: 4
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 500933 usec   graph-compute took 392548 usec
 5009.33 usec per-iter            3925.48 usec per-iter
 2504.67 nsec per-node            1962.74 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 8                     n_threads: 8
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 631755 usec   graph-compute took 607810 usec
 6317.55 usec per-iter            6078.1 usec per-iter
 3158.78 nsec per-node            3039.05 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 16                    n_threads: 16
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 1126331 usec  graph-compute took 1062131 usec
 11263.3 usec per-iter            10621.3 usec per-iter
 5631.66 nsec per-node            5310.66 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 32                    n_threads: 32
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 1949504 usec  graph-compute took 1902214 usec
 19495 usec per-iter              19022.1 usec per-iter
 9747.52 nsec per-node            9511.07 nsec per-node

omp ---                          tp ---
graph-compute with               graph-compute with
 n_threads: 64                    n_threads: 64
   n_nodes: 2000                    n_nodes: 2000
  n_rounds: 100                    n_rounds: 100
graph-compute took 2346060 usec  graph-compute took 2207780 usec
 23460.6 usec per-iter            22077.8 usec per-iter
 11730.3 nsec per-node            11038.9 nsec per-node

@max-krasnyansky max-krasnyansky changed the title WIP: threads: improve ggml_barrier scaling with large number of threads threads: improve ggml_barrier scaling with large number of threads Sep 23, 2024
@max-krasnyansky
Copy link
Collaborator Author

max-krasnyansky commented Sep 23, 2024

Numbers from X-Elite (MS Surface 7).

 ~/src/llama.cpp-master (master)
$ ./build-arm64-windows-llvm-release/bin/test-barrier.exe 12 100
graph-compute with
 n_threads: 12
   n_nodes: 2000
  n_rounds: 100
graph-compute took 622367 usec
 6223.67 usec per-iter
 3111.84 nsec per-node

~/src/llama.cpp-threadpool (improve-ggml-barrier-scaling)
$ ./build-arm64-windows-llvm-release/bin/test-barrier.exe 12 100
graph-compute with
 n_threads: 12
   n_nodes: 2000
  n_rounds: 100
graph-compute took 328799 usec
 3287.99 usec per-iter
 1644 nsec per-node

GCC and LLVM ThreadSanitizers are happy.

@max-krasnyansky
Copy link
Collaborator Author

Thanks! I'll test it, but I'm at a conference the whole week so it might be delayed.

Out of curiosity, why do you need the barrier at the end ? Isn't it sufficient to turn the load in the loop into a memory_order_acquire instead ?

Yes, it would be sufficient but that's more expensive while polling.
So we poll with relaxed order (just normal load + yield) and then do one final barrier to makes sure nothing get reordered past that point.

Also regarding the use of ggml_barrier(), there can be plenty of valid use cases for requiring such a barrier, but sometimes you observe that the requirement spans over a moderately large area of code (i.e. you don't want another thread to get here before all others got somewhere else). I don't know if that's the case here, but if so, it can be effective to split the barrier in two functions:

  • one that performs the increment, to count that the thread passed by that point
  • one that waits for all threads to have completed their round.
    When doing so it can amortize a bunch of the loop cost through useful work outside, but it's more complex in this case as it requires to record the status of "last thread" somewhere.

In the GGML we use the barrier to sync all threads after Graph Ops such as very large Matrix Multiplies where each thread works on a different chunk. Before the next Op can be processed all the chunks from the previous Op need to be completed.

@wtarreau
Copy link
Contributor

Thanks! I'll test it, but I'm at a conference the whole week so it might be delayed.
Out of curiosity, why do you need the barrier at the end ? Isn't it sufficient to turn the load in the loop into a memory_order_acquire instead ?

Yes, it would be sufficient but that's more expensive while polling. So we poll with relaxed order (just normal load + yield) and then do one final barrier to makes sure nothing get reordered past that point.

I see. On x86 the code is the same (obviously) and on ARM (neoverse-N1 ampere altra), I'm not seeing any difference. But I generally agree with your point anyway. There could be some archs with heavier atomics (or worse, some who have to fall back to seqcst) where it could cost more.

Also regarding the use of ggml_barrier(), there can be plenty of valid use cases for requiring such a barrier, but sometimes you observe that the requirement spans over a moderately large area of code (i.e. you don't want another thread to get here before all others got somewhere else). I don't know if that's the case here, but if so, it can be effective to split the barrier in two functions:

  • one that performs the increment, to count that the thread passed by that point
  • one that waits for all threads to have completed their round.
    When doing so it can amortize a bunch of the loop cost through useful work outside, but it's more complex in this case as it requires to record the status of "last thread" somewhere.

In the GGML we use the barrier to sync all threads after Graph Ops such as very large Matrix Multiplies where each thread works on a different chunk. Before the next Op can be processed all the chunks from the previous Op need to be completed.

OK. Thus I guess there's not even any preparatory work for the next op that the thread can start without waiting for others (e.g. resetting vectors, pre-loading values, initiating memory prefetches etc) ?

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.

---
Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.

Co-authored-by: Willy Tarreau <w@1wt.eu>
@max-krasnyansky
Copy link
Collaborator Author

@ggerganov @slaren
Sorry for spamming you guys with the ggml_barrier updates.
Hopefully it's the last one before we move threading stuff into C++.
This is the follow up to our previous discussion on scaling with larger number of threads.
Let me know if you guys are OK with the TSAN (thread sanitizer) "hack". I'm basically following
recommendation from the TSAN author on how to work around its limitation.

@wtarreau
Copy link
Contributor

Don't be sorry when improving low-level code :-) FWIW I've retested it on the 80-core ARM and it still builds and runs as fast as the previous iteration (even if I had no doubt I preferred to double check). Thanks!

@slaren
Copy link
Collaborator

slaren commented Sep 23, 2024

Performance on 13900k also looks good.

13900k
omp - 1 threads
graph-compute with
 n_threads: 1
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1011686 usec
 1011.69 usec per-iter
 505.843 nsec per-node

tp - 1 threads
graph-compute with
 n_threads: 1
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1009844 usec
 1009.84 usec per-iter
 504.922 nsec per-node

==
omp - 2 threads
graph-compute with
 n_threads: 2
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1381833 usec
 1381.83 usec per-iter
 690.917 nsec per-node

tp - 2 threads
graph-compute with
 n_threads: 2
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1238383 usec
 1238.38 usec per-iter
 619.192 nsec per-node

==
omp - 4 threads
graph-compute with
 n_threads: 4
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1546122 usec
 1546.12 usec per-iter
 773.061 nsec per-node

tp - 4 threads
graph-compute with
 n_threads: 4
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1403747 usec
 1403.75 usec per-iter
 701.874 nsec per-node

==
omp - 8 threads
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 2115183 usec
 2115.18 usec per-iter
 1057.59 nsec per-node

tp - 8 threads
graph-compute with
 n_threads: 8
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1911930 usec
 1911.93 usec per-iter
 955.965 nsec per-node

==
omp - 16 threads
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3354951 usec
 3354.95 usec per-iter
 1677.48 nsec per-node

tp - 16 threads
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3460484 usec
 3460.48 usec per-iter
 1730.24 nsec per-node

==
omp - 32 threads
graph-compute with
 n_threads: 32
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 6886784 usec
 6886.78 usec per-iter
 3443.39 nsec per-node

tp - 32 threads
graph-compute with
 n_threads: 32
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 6628458 usec
 6628.46 usec per-iter
 3314.23 nsec per-node

@max-krasnyansky max-krasnyansky merged commit f0c7b5e into ggerganov:master Sep 23, 2024
53 checks passed
@wtarreau
Copy link
Contributor

Many thanks for handling this, @max-krasnyansky.

@giladgd
Copy link
Contributor

giladgd commented Sep 23, 2024

@max-krasnyansky With this PR I get this error when compiling on Windows using CMake, where llama.cpp is a sub-library inside of another CMake project:

Auto build dll exports
     Creating library D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.lib and object D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.exp
LINK : warning LNK4199: /DELAYLOAD:NODE.EXE ignored; no imports found from NODE.EXE [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
ggml.obj : error LNK2019: unresolved external symbol atomic_thread_fence referenced in function ggml_barrier [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\bin\Release\ggml.dll : fatal error LNK1120: 1 unresolved externals [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]

I think there might be a missing #include somewhere in this PR that causes it only to work when compiling llama.cpp directly.

@max-krasnyansky
Copy link
Collaborator Author

@max-krasnyansky With this PR I get this error when compiling on Windows using CMake, where llama.cpp is a sub-library inside of another CMake project:

Auto build dll exports
     Creating library D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.lib and object D:/a/node-llama-cpp/node-llama-cpp/llama/localBuilds/win-x64/llama.cpp/ggml/src/Release/ggml.exp
LINK : warning LNK4199: /DELAYLOAD:NODE.EXE ignored; no imports found from NODE.EXE [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
ggml.obj : error LNK2019: unresolved external symbol atomic_thread_fence referenced in function ggml_barrier [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]
D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\bin\Release\ggml.dll : fatal error LNK1120: 1 unresolved externals [D:\a\node-llama-cpp\node-llama-cpp\llama\localBuilds\win-x64\llama.cpp\ggml\src\ggml.vcxproj]

I think there might be a missing #include somewhere in this PR that causes it only to work when compiling llama.cpp directly.

Oh. Shoot. My bad. I completely forgot to add a stub for the function for native MSVC builds.
I wonder how come our main CI didn't catch it.

Will start another PR shortly with a 3-liner fix.

@max-krasnyansky
Copy link
Collaborator Author

Oh. Shoot. My bad. I completely forgot to add a stub for the function for native MSVC builds. I wonder how come our main CI didn't catch it.

Will start another PR shortly with a 3-liner fix.

#9615

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
…gerganov#9598)

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.

---
Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.

Co-authored-by: Willy Tarreau <w@1wt.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow
4 participants