-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threads: improve ggml_barrier scaling with large number of threads #9598
threads: improve ggml_barrier scaling with large number of threads #9598
Conversation
Here are some before/after numbers from our barrier test: M2 MaxBefore
After
I'm going to test it some more on X-Elite and 64-core EPYC and remote WIP tag once it looks good. @wtarreau |
Thanks! I'll test it, but I'm at a conference the whole week so it might be delayed. Out of curiosity, why do you need the barrier at the end ? Isn't it sufficient to turn the load in the loop into a memory_order_acquire instead ? Also regarding the use of ggml_barrier(), there can be plenty of valid use cases for requiring such a barrier, but sometimes you observe that the requirement spans over a moderately large area of code (i.e. you don't want another thread to get here before all others got somewhere else). I don't know if that's the case here, but if so, it can be effective to split the barrier in two functions:
|
OK finally I figured it would be fast to try again since the code and models are still there in my local tree. I just applied the patch, reran the test, and as expected the performance is exactly the same so I'm obviously fine with your changes :-) |
Numbers are looking good on the EPYC and Ryzen.
|
Numbers from X-Elite (MS Surface 7).
GCC and LLVM ThreadSanitizers are happy. |
Yes, it would be sufficient but that's more expensive while polling.
In the GGML we use the barrier to sync all threads after Graph Ops such as very large Matrix Multiplies where each thread works on a different chunk. Before the next Op can be processed all the chunks from the previous Op need to be completed. |
I see. On x86 the code is the same (obviously) and on ARM (neoverse-N1 ampere altra), I'm not seeing any difference. But I generally agree with your point anyway. There could be some archs with heavier atomics (or worse, some who have to fall back to seqcst) where it could cost more.
OK. Thus I guess there's not even any preparatory work for the next op that the thread can start without waiting for others (e.g. resetting vectors, pre-loading values, initiating memory prefetches etc) ? |
Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <w@1wt.eu>
7b6e983
to
b42f420
Compare
@ggerganov @slaren |
Don't be sorry when improving low-level code :-) FWIW I've retested it on the 80-core ARM and it still builds and runs as fast as the previous iteration (even if I had no doubt I preferred to double check). Thanks! |
Performance on 13900k also looks good. 13900k
|
Many thanks for handling this, @max-krasnyansky. |
@max-krasnyansky With this PR I get this error when compiling on Windows using CMake, where
I think there might be a missing |
Oh. Shoot. My bad. I completely forgot to add a stub for the function for native MSVC builds. Will start another PR shortly with a 3-liner fix. |
|
…gerganov#9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <w@1wt.eu>
Fixes: #9588
Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases.
Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.
Here is the original description and suggestions from Willy Tarreau :
There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers.
Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version.
Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them.
Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization.
It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread.