Very slow IQ quant performance on Apple Silicon || Expected performance on IQ llama.cpp implementation on Apple Silicon? #5617
Replies: 6 comments 11 replies
-
I downloaded a 120B IQ2_XS GGUF model from https://huggingface.co/dranger003/miquliz-120b-v2.0-iMat.GGUF/tree/main and run to test. With my M1 Max Mac Studio (8+2 CPU, 10 GPU, 64GB RAM), speed from
It is quite slow, but expected speed on M1 Max. And I found that your speed is too slow. |
Beta Was this translation helpful? Give feedback.
-
I guess the numbers from your post look like from CPU-only inference. Did you put |
Beta Was this translation helpful? Give feedback.
-
On this note, I am wondering if more optimization can be done on Apple Silicon to run these models even faster. |
Beta Was this translation helpful? Give feedback.
-
Apple Silicon is not very friendly to the So, in short, I agree. If someone knows how to trick Apple into better performance for the |
Beta Was this translation helpful? Give feedback.
-
IQ4_NL is as fast as Q4_K. I haven't try 2 bit or 3 bit IQ, but imho Apple Silicon really slow on below 4 bit. Even q5_k_m still faster than q3_k_s. |
Beta Was this translation helpful? Give feedback.
-
it's slow too on my arm device |
Beta Was this translation helpful? Give feedback.
-
Hey there,
I've been playing about with the IQ quantisation methods, I have an M1 Max Pro with 64GB of RAM, I usually run mixtral finetunes (8x7b with 2 experts) at Q5_K_M and get reasonable performance (8-15t/s) and prompt evaluation normally takes 10-20seconds even on very large prompts.
I downloaded an IQ2 XS quant of a 120B model and it's taking close to 10 minutes to evaluate the prompt and I'm getting like 1 token every 4-5 seconds, is this expected?
Prompt evaluation: 50%| | 2/4 [03:45<03:45, 112.99s/it]
This does eventually finish but it takes 9 minutes and then I get about 0.1 tokens per second.
The performance with an IQ2 XS of a 7b model is also pretty bad but at least it finishes before the heat death of the universe:
Output generated in 26.30 seconds (0.95 tokens/s, 25 tokens, context 1523, seed 1885244309)
Which is why I'm not sure if I'm running into a bug, if it's not been designed/optimised for metal or if this is expected performance
I understand the original QuiP# paper and implementation is CUDA focused, is this just an area where it's not optimised? If that is the case is there any plans to optimise the metal implementation for these newer quants? Also not sure if this is relevant but during this process my CPU usage doesn't max out like it normally does when generating text with these models, normally it pins all my cores, this takes a few cores to 70% and python uses 400% of CPU compared to like 2800% or whatever it normally does.
Also wasn't sure if this should be an issue or a discussion so just decided to go on the safe side and make a discussion
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions