Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : ggml (Metal F32 support + reduce ggml-alloc size) #3192

Merged
merged 2 commits into from
Sep 15, 2023
Merged

Conversation

ggerganov
Copy link
Owner

No description provided.

Comment on lines +345 to +346
// 128GB for 64-bit, 1GB for 32-bit
*size = sizeof(void *) == 4 ? 1ULL<<30 : 1ULL<<37;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a possibility that it will cause issues with the training and finetuning examples with large models, since it will limit the size of the allocator buffers to 128GB. I don't think that's going to be a problem in any practical sense, but it's something to keep in mind.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will keep this in mind. Would it make sense to become user-configurable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need a workaround, I think it would be OK to allow changing the value at compile time with a preprocessor definition, but I don't think it is worth changing the API because I expect that this will be fixed with the backends interface.

@ggerganov ggerganov merged commit 8c00b7a into master Sep 15, 2023
34 checks passed
@ggerganov ggerganov deleted the sync branch September 15, 2023 16:06
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
)

* sync : ggml (Metal F32 support + reduce ggml-alloc size)

ggml-ci

* llama-bench : fix ggml_cpu_has_metal() duplicate function

ggml-ci
@@ -17294,10 +17294,18 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
} else {
// wait for other threads to finish
const int last = node_n;
do {
//sched_yield();
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Have you tried running on the CPU lately? For me this change lowers Q4_0 TG performance from ~28 t/s to ~17 t/s when running on the CPU (M2 Max)

Copy link
Owner Author

@ggerganov ggerganov Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does removing sched_yield() fix the performance?

On M2 Ultra, current master is:

./llama-bench -m ./models/llama-7b/ggml-model-q4_0.gguf -t 4,8,12,16 -n 128 -ngl 0 -p 0
model size params threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 4 tg 128 19.25 ± 0.20
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 8 tg 128 33.07 ± 0.06
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 12 tg 128 43.24 ± 0.12
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 16 tg 128 48.00 ± 0.12

build: 4aea3b8 (1281)

If I remove the sched_yield(), the performance slightly drops:

model size params threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 4 tg 128 19.55 ± 0.17
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 8 tg 128 32.69 ± 0.17
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 12 tg 128 40.83 ± 0.12
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 16 tg 128 46.91 ± 0.58

build: 4aea3b8 (1281) without sched_yield()


On M1 Pro, performance also drops a bit:

model size params threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 4 tg 128 18.93 ± 0.10
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 8 tg 128 26.92 ± 0.16

build: 4aea3b8 (1281)

model size params threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 4 tg 128 18.52 ± 0.01
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B 8 tg 128 25.63 ± 0.74

build: 4aea3b8 (1281) without sched_yield()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using asm volatile ("yield" ::: "memory"); instead of sched_yield? See the discussion at #3219 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what I get:

Master

model size params backend ngl threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 4 tg 128 17.11 ± 0.27
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 8 tg 128 18.31 ± 1.64

With @cebtenzzre's suggestion to use assembler:

model size params backend ngl threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 4 tg 128 18.15 ± 0.17
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 8 tg 128 28.02 ± 0.60

Without yielding

model size params backend ngl threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 4 tg 128 18.30 ± 0.01
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 8 tg 128 28.31 ± 0.12

One might think that asm volatile ("yield" ::: "memory"); instead of sched_yield() solves it, but then I rerun llama-bench a second time with that, and here is what I get

model size params backend ngl test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 tg 128 24.11 ± 5.15

So, I guess, it is great you are getting 1-2% improvements on the Ultra and Pro, but I'm getting a 40% regression, and this is annoying. Not to mention the time I lost trying to find the cause for the bad performance in my implementation of the dot product for a new quantization type, to only eventually figure out that I have bad performance for all quantization types, and to then bisect until hitting this commit.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about implementing some logic to determine if the sched_yield() should be used or not. Or alternatively, become user-configurable (ggerganov/ggml#291).

The sched_yield() brings anywhere between 20-40% speedup for the Whisper encoder - this was the main reason to add it back. It didn't cause any regression on the 3 Apple Silicon devices that I test with (M1, M1 Pro and M2 Ultra), so I didn't expect it to result in a regression on M2 Max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants