-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532
Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532
Conversation
The perplexity was measured for models quantized from meta llama2 7B model with the following command : It calculated perplexity over 655 chunks : The perplexity results are tabulated as follows :
The perplexity readings were found to be the same before and after the changes |
Ah. Sorry I didn't comment in time. |
Is quantitization of safetensors to gguf in Q4_0_8_8, Q4_0_4_8, Q4_0_4_4 implemented? It seems to me there are no corresponding classes in gguf-py/gguf/quants.py, so probably not? Then, how did bartowski create those ggufs? |
i.e. just use the normal model conversion/quantization flow:
|
Thanks max for the reply. I can now create the ggufs. Based on what I read from my understanding of what people posted at LocalLlama, Q4_0_8_8 is optimized for ARM device with SVE support, Q4_0_4_8 is for ARM with I8MM support and Q4_0_4_4 is for ARM with NEON support. Is that correct? Is Q4_0_8_8 the fastest implementation, followed by Q4_0_4_8 and then Q4_0_4_4? I noticed that Q4_0 while has the same size as these three models, it has different md5sum. Is it safe to say that it is optimized for NVIDIA, so it is likely to not work as well compare to Q4_0_4_4 on typical ARM smartphones? People also said these models are optimized for CPUs (presumably Intel/AMD), so which Q4_0 model is best for them? Does support of AVX512 matter? The new Apple M4 chip supports SME which is a more advanced version of SVE. Does that mean it is best to use Q4_0_8_8? I am in the process of making small models for edge devices. So it would be great if I can make ggufs that are optimized for different ARM smartphones/tablets. Thanks a lot in advance. |
Hi @max-krasnyansky , I believe that something like ggml-interleave could be a more appropriate name for the file instead of having the implementations in a separate file as for the other implementations(ggml-quants.c, sgemm.cpp etc) we have all the different implementations for different architecture in the same file separated by flags. Maybe the authors/maintainers of llama.cpp could take an appropriate call if needed here. Thanks Hi @ymcki , the usage of Q4_0_8_8 models in AVX2/AVX512 SIMD machines is expected to go faster than normal Q4_0 models here, as evidenced by these results and PR 8713. Usage of AVX512 is expected to help it go faster than that of AVX2 machines. Thanks |
In the future we should probably look into performing the conversion to Q4_0_x_x at runtime while loading, in a similar way backends like CANN and AMX (in development) convert some types to their preferred layout. The disadvantage is that it would require disabling mmap, however. |
Thanks everyone for the inputs. I updated the README.md for my hf repository to describe the other Q4_0 models: https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF Please let me know if I made any factual mistakes. |
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
GCC Linux :
Q4_0_8_8 Model :
GCC Version = 12.3
The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b
The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|