Skip to content

Releases: ngxson/llama.cpp

b3811

23 Sep 20:05
f0c7b5e
Compare
Choose a tag to compare
threads: improve ggml_barrier scaling with large number of threads (#…

…9598)

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.

---
Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.

Co-authored-by: Willy Tarreau <w@1wt.eu>

b3808

23 Sep 16:03
1e7b929
Compare
Choose a tag to compare
ggml : AVX512 gemm for Q4_0_8_8 (#9532)

* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

b3807

23 Sep 10:06
37f8c7b
Compare
Choose a tag to compare
perplexity : remove extra new lines after chunks (#9596)

b3805

23 Sep 05:11
e62e978
Compare
Choose a tag to compare
Revert "[SYCL] fallback mmvq (#9088)" (#9579)

This reverts commit 50addec9a532a6518146ab837a85504850627316.

b3804

22 Sep 17:08
c35e586
Compare
Choose a tag to compare
musa: enable building fat binaries, enable unified memory, and disabl…

…e Flash Attention on QY1 (MTT S80) (#9526)

* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

b3803

22 Sep 15:10
912c331
Compare
Choose a tag to compare
Fix merge error in #9454 (#9589)

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

b3802

22 Sep 09:10
a5b57b0
Compare
Choose a tag to compare
CUDA: enable Gemma FA for HIP/Pascal (#9581)

b3801

22 Sep 04:15
ecd5d6b
Compare
Choose a tag to compare
llama: remove redundant loop when constructing ubatch (#9574)

b3799

21 Sep 14:07
d09770c
Compare
Choose a tag to compare
ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG …

…(#9573)

b3798

21 Sep 02:05
41f4778
Compare
Choose a tag to compare
Update CUDA graph on scale change plus clear nodes/params  (#9550)

* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes https://github.com/ggerganov/llama.cpp/issues/9451

* clear before resize