Very slow IQ quant performance on Apple Silicon || Expected performance on IQ llama.cpp implementation on Apple Silicon? #5617

agnosticlines · 2024-02-20T19:34:57Z

agnosticlines
Feb 20, 2024

Hey there,

I've been playing about with the IQ quantisation methods, I have an M1 Max Pro with 64GB of RAM, I usually run mixtral finetunes (8x7b with 2 experts) at Q5_K_M and get reasonable performance (8-15t/s) and prompt evaluation normally takes 10-20seconds even on very large prompts.

I downloaded an IQ2 XS quant of a 120B model and it's taking close to 10 minutes to evaluate the prompt and I'm getting like 1 token every 4-5 seconds, is this expected?

Prompt evaluation: 50%| | 2/4 [03:45<03:45, 112.99s/it]

This does eventually finish but it takes 9 minutes and then I get about 0.1 tokens per second.

The performance with an IQ2 XS of a 7b model is also pretty bad but at least it finishes before the heat death of the universe:

Output generated in 26.30 seconds (0.95 tokens/s, 25 tokens, context 1523, seed 1885244309)

Which is why I'm not sure if I'm running into a bug, if it's not been designed/optimised for metal or if this is expected performance

I understand the original QuiP# paper and implementation is CUDA focused, is this just an area where it's not optimised? If that is the case is there any plans to optimise the metal implementation for these newer quants? Also not sure if this is relevant but during this process my CPU usage doesn't max out like it normally does when generating text with these models, normally it pins all my cores, this takes a few cores to 70% and python uses 400% of CPU compared to like 2800% or whatever it normally does.

Also wasn't sure if this should be an issue or a discussion so just decided to go on the safe side and make a discussion

Thanks!

beebopkim · 2024-02-21T14:21:50Z

beebopkim
Feb 21, 2024

I downloaded a 120B IQ2_XS GGUF model from https://huggingface.co/dranger003/miquliz-120b-v2.0-iMat.GGUF/tree/main and run to test.

With my M1 Max Mac Studio (8+2 CPU, 10 GPU, 64GB RAM), speed from server is like below:

print_timings: prompt eval time =   99600.17 ms /  2392 tokens (   41.64 ms per token,    24.02 tokens per second)
print_timings:        eval time =   85880.03 ms /   268 runs   (  320.45 ms per token,     3.12 tokens per second)
print_timings:       total time =  185480.20 ms
slot 0 released (2660 tokens in cache)

print_timings: prompt eval time =   99004.01 ms /  2392 tokens (   41.39 ms per token,    24.16 tokens per second)
print_timings:        eval time =  128903.87 ms /   400 runs   (  322.26 ms per token,     3.10 tokens per second)
print_timings:       total time =  227907.88 ms
slot 0 released (2792 tokens in cache)

It is quite slow, but expected speed on M1 Max. And I found that your speed is too slow.

0 replies

beebopkim · 2024-02-21T14:30:32Z

beebopkim
Feb 21, 2024

I guess the numbers from your post look like from CPU-only inference. Did you put --ngl 999 when you run your model?

3 replies

agnosticlines Feb 21, 2024
Author

I put 32 layers in the text-generation-webui.

With 999 I get:

ggml_metal_graph_compute: command buffer 0 failed with status 5

And the output is gibberish, can you post the exact command you're running?

beebopkim Feb 21, 2024

The exact command line is:

./server -m $LLM_MODEL_QUANT/dranger003_miquliz-120b-v2.0-iMat.GGUF/ggml-miquliz-120b-v2.0-iq2_xs.gguf -ngl 999 -c 16384

LLM_MODEL_QUANT is a shell variable for the upper directory of quantized models as a shortcut.

And server's initial output is like this:

{"timestamp":1708526095,"level":"INFO","function":"main","line":2533,"message":"build info","build":2224,"commit":"cc6cac08"}
{"timestamp":1708526095,"level":"INFO","function":"main","line":2540,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1708526095,"level":"INFO","function":"main","line":2692,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
llama_model_loader: loaded meta data with 24 key-value pairs and 1263 tensors from /Volumes/cuttingedge/large_language_models/models_ggml_converted/dranger003_miquliz-120b-v2.0-iMat.GGUF/ggml-miquliz-120b-v2.0-iq2_xs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = C:\LLM_MODELS\wolfram
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 140
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 20
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  281 tensors
llama_model_loader: - type q2_K:   18 tensors
llama_model_loader: - type q4_K:  140 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq2_xs:  823 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ2_XS - 2.3125 bpw
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 32.95 GiB (2.35 BPW) 
llm_load_print_meta: general.name     = C:\LLM_MODELS\wolfram
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.96 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 33658.17 MiB, (33658.23 / 49152.00)
llm_load_tensors: offloading 140 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 141/141 layers to GPU
llm_load_tensors:      Metal buffer size = 33658.17 MiB
llm_load_tensors:        CPU buffer size =    82.03 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/******/test/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  8960.00 MiB, (42620.05 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =  8960.00 MiB
llama_new_context_with_model: KV self size  = 8960.00 MiB, K (f16): 4480.00 MiB, V (f16): 4480.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    49.13 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  2144.08 MiB, (44764.12 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =  2144.06 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.00 MiB
llama_new_context_with_model: graph splits (measure): 3
Available slots:
 -> Slot 0 - max context: 16384
{"timestamp":1708526117,"level":"INFO","function":"main","line":2713,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache

agnosticlines Feb 21, 2024
Author

Ah okay yeah that works, must be an issue with text-generation-webui then, it's not correctly propagating the gpu layer count maybe? Yeah you're right it is still very slow though

{"timestamp":1708527185,"level":"INFO","function":"log_server_request","line":2476,"message":"request","remote_addr":"127.0.0.1","remote_port":57338,"status":200,"method":"POST","path":"/tokenize","params":{}}
slot 0 is processing [task id: 759]
slot 0 : in cache: 6 tokens | to process: 1102 tokens
slot 0 : kv cache rm - [6, end)

print_timings: prompt eval time =   63344.10 ms /  1102 tokens (   57.48 ms per token,    17.40 tokens per second)
print_timings:        eval time =   97833.27 ms /   271 runs   (  361.01 ms per token,     2.77 tokens per second)
print_timings:       total time =  161177.37 ms
slot 0 released (1379 tokens in cache)

Xonar92 · 2024-02-21T19:24:15Z

Xonar92
Feb 21, 2024

On this note, I am wondering if more optimization can be done on Apple Silicon to run these models even faster.

0 replies

ikawrakow · 2024-02-22T17:52:37Z

ikawrakow
Feb 22, 2024

Apple Silicon is not very friendly to the IQ series of quants. These quants use a "codebook" to encode the quantized values of groups of 4 or 8 model weights, so this requires many loads from a lookup table when doing matrix multiplications, and it seems Apple Silicon doesn't like that too much. I get around 50 t/s for a 7B model on a 30-core M2 Max GPU. Given that inference speed is nearly inversely proportional to model size, the estimate for a 120B model would be in the range of 7/120*50 = 3 t/s, which is what you get. If I compare to a simple quantization such as Q4_0, I get 63 t/s for a 7B model on the M2-Max GPU, and about 128 t/s on an RTX-4080, so the Apple GPU is just 2X slower. In contrast, I get 175 t/s on the RTX-4080 vs 50 t/s on the M2-Max for IQ2_XS, so a factor of 3.5 difference.

So, in short, I agree. If someone knows how to trick Apple into better performance for the IQ quants, that would be great.

8 replies

ikawrakow Mar 11, 2024

No, I have not tried 128 threads, but did play around a bit with number of rows processed by a kernel, and number of quants per thread. The interesting thing here is that some k-quants are faster with 32 threads (e.g., Q4_K), so increasing the number of threads did not seem like a good idea. But at the end of the day, nothing beats Q4_0 for TG on Apple Silicon, and my guess is that this is the case because in that case, the available memory bandwidth is in perfect match with the little bit of compute that the M chips have to offer. Go to Q8_0 (less compute), there isn't enough memory bandwidth. Go to some other quant that needs less memory bandwidth, there isn't enough compute.

ikawrakow Mar 11, 2024

Just tried 128 threads for IQ1_S with 4 rows per thread. It is ~1-2% slower.

ikawrakow Mar 11, 2024

128 threads with 2 rows per thread is 15% slower.

ikawrakow Mar 11, 2024

128 threads with 8 raws per thread is ~1-2% slower. So, in short, 128 threads does not help.

agnosticlines Mar 17, 2024
Author

@ikawrakow I saw in your original PR you included a metal benchmark

53.9 t/s on Metal (30-core M2 Max GPU) vs 63.1 t/s for Q4_0

Can you share the model name that you used to get this performance? I didn't see it listed anywhere, would you recommend IQ quants for smaller models on metal? Seems like they're smarter in less storage, and the codebook lookup stuff isn't quite so bad for smaller models? I assume there's nothing that Apple have specifically done for M2 that makes these codebook lookups easier on the hardware?

Koesn · 2024-02-28T04:02:06Z

Koesn
Feb 28, 2024

IQ4_NL is as fast as Q4_K. I haven't try 2 bit or 3 bit IQ, but imho Apple Silicon really slow on below 4 bit. Even q5_k_m still faster than q3_k_s.

0 replies

Fuckingnameless · 2024-02-29T14:12:24Z

Fuckingnameless
Feb 29, 2024

it's slow too on my arm device

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow IQ quant performance on Apple Silicon || Expected performance on IQ llama.cpp implementation on Apple Silicon? #5617

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Very slow IQ quant performance on Apple Silicon || Expected performance on IQ llama.cpp implementation on Apple Silicon? #5617

Replies: 6 comments · 11 replies

agnosticlines Feb 21, 2024 Author

agnosticlines Feb 21, 2024 Author

agnosticlines Mar 17, 2024 Author

Replies: 6 comments 11 replies

agnosticlines Feb 21, 2024
Author

agnosticlines Feb 21, 2024
Author

agnosticlines Mar 17, 2024
Author