-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync : ggml (Metal F32 support + reduce ggml-alloc size) #3192
Conversation
// 128GB for 64-bit, 1GB for 32-bit | ||
*size = sizeof(void *) == 4 ? 1ULL<<30 : 1ULL<<37; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a possibility that it will cause issues with the training and finetuning examples with large models, since it will limit the size of the allocator buffers to 128GB. I don't think that's going to be a problem in any practical sense, but it's something to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will keep this in mind. Would it make sense to become user-configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need a workaround, I think it would be OK to allow changing the value at compile time with a preprocessor definition, but I don't think it is worth changing the API because I expect that this will be fixed with the backends interface.
@@ -17294,10 +17294,18 @@ static thread_ret_t ggml_graph_compute_thread(void * data) { | |||
} else { | |||
// wait for other threads to finish | |||
const int last = node_n; | |||
do { | |||
//sched_yield(); | |||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Have you tried running on the CPU lately? For me this change lowers Q4_0
TG performance from ~28 t/s to ~17 t/s when running on the CPU (M2 Max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does removing sched_yield()
fix the performance?
On M2 Ultra, current master
is:
./llama-bench -m ./models/llama-7b/ggml-model-q4_0.gguf -t 4,8,12,16 -n 128 -ngl 0 -p 0
model | size | params | threads | test | t/s |
---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 4 | tg 128 | 19.25 ± 0.20 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 8 | tg 128 | 33.07 ± 0.06 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 12 | tg 128 | 43.24 ± 0.12 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 16 | tg 128 | 48.00 ± 0.12 |
build: 4aea3b8 (1281)
If I remove the sched_yield()
, the performance slightly drops:
model | size | params | threads | test | t/s |
---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 4 | tg 128 | 19.55 ± 0.17 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 8 | tg 128 | 32.69 ± 0.17 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 12 | tg 128 | 40.83 ± 0.12 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 16 | tg 128 | 46.91 ± 0.58 |
build: 4aea3b8 (1281) without sched_yield()
On M1 Pro, performance also drops a bit:
model | size | params | threads | test | t/s |
---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 4 | tg 128 | 18.93 ± 0.10 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 8 | tg 128 | 26.92 ± 0.16 |
build: 4aea3b8 (1281)
model | size | params | threads | test | t/s |
---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 4 | tg 128 | 18.52 ± 0.01 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | 8 | tg 128 | 25.63 ± 0.74 |
build: 4aea3b8 (1281) without sched_yield()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using asm volatile ("yield" ::: "memory");
instead of sched_yield? See the discussion at #3219 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I get:
Master
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 0 | 4 | tg 128 | 17.11 ± 0.27 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 0 | 8 | tg 128 | 18.31 ± 1.64 |
With @cebtenzzre's suggestion to use assembler:
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 0 | 4 | tg 128 | 18.15 ± 0.17 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 0 | 8 | tg 128 | 28.02 ± 0.60 |
Without yielding
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 0 | 4 | tg 128 | 18.30 ± 0.01 |
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 0 | 8 | tg 128 | 28.31 ± 0.12 |
One might think that asm volatile ("yield" ::: "memory");
instead of sched_yield()
solves it, but then I rerun llama-bench
a second time with that, and here is what I get
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
LLaMA 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 0 | tg 128 | 24.11 ± 5.15 |
So, I guess, it is great you are getting 1-2% improvements on the Ultra and Pro, but I'm getting a 40% regression, and this is annoying. Not to mention the time I lost trying to find the cause for the bad performance in my implementation of the dot product for a new quantization type, to only eventually figure out that I have bad performance for all quantization types, and to then bisect until hitting this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about implementing some logic to determine if the sched_yield()
should be used or not. Or alternatively, become user-configurable (ggerganov/ggml#291).
The sched_yield()
brings anywhere between 20-40% speedup for the Whisper encoder - this was the main reason to add it back. It didn't cause any regression on the 3 Apple Silicon devices that I test with (M1, M1 Pro and M2 Ultra), so I didn't expect it to result in a regression on M2 Max
No description provided.