Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532

Merged

Conversation

Srihari-mcw
Copy link
Contributor

  • The PR contains AVX512 Version of ggml_gemm_q4_0_8x8_q8_0 used for Q4_0_8_8 quantized models
  • Good gains were seen especially with prompt processing with the above changes compared to existing AVX2 version
  • PR introduces AVX512 integer variant for mul_sum_i8_pairs function for performing dot product operations and macros for conversion from half precision to full precision based on F16C intrinsics support

GCC Linux :

Q4_0_8_8 Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 pp 512 67.87 ± 0.15 faf67b3
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 pp 512 103.70 ± 0.33 52.8% a829583
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 tg 128 15.13 ± 0.00 faf67b3
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 tg 128 15.12 ± 0.00 0% a829583

GCC Version = 12.3

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 18, 2024
@Srihari-mcw
Copy link
Contributor Author

Srihari-mcw commented Sep 18, 2024

The perplexity was measured for models quantized from meta llama2 7B model with the following command :
./llama-perplexity -m <model_name> -f wikitext-2-raw/wiki.test.raw

It calculated perplexity over 655 chunks :
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4

The perplexity results are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
llama 7B Q4_0_8_8 5.9625 +/- 0.03348 faf67b3
llama 7B Q4_0_8_8 5.9625 +/- 0.03348 a829583

The perplexity readings were found to be the same before and after the changes

ggml/src/ggml-aarch64.c Outdated Show resolved Hide resolved
@ggerganov ggerganov merged commit 1e7b929 into ggerganov:master Sep 23, 2024
53 checks passed
@max-krasnyansky
Copy link
Collaborator

Ah. Sorry I didn't comment in time.
Ideally this should go into ggml/src/ggml-x64.c (aka x86-64 & amd64)

@ymcki
Copy link

ymcki commented Oct 4, 2024

Is quantitization of safetensors to gguf in Q4_0_8_8, Q4_0_4_8, Q4_0_4_4 implemented?

It seems to me there are no corresponding classes in gguf-py/gguf/quants.py, so probably not?

Then, how did bartowski create those ggufs?
https://huggingface.co/bartowski/magnum-v2-4b-GGUF

@max-krasnyansky
Copy link
Collaborator

Is quantitization of safetensors to gguf in Q4_0_8_8, Q4_0_4_8, Q4_0_4_4 implemented?

It seems to me there are no corresponding classes in gguf-py/gguf/quants.py, so probably not?

Then, how did bartowski create those ggufs? https://huggingface.co/bartowski/magnum-v2-4b-GGU`F

llama-quantize tool can produce those types.

i.e. just use the normal model conversion/quantization flow:

  • convert_hf_to_gguf.py to safe F16 GGUF
  • llama-quantize to convert F16 to Q4_0_X_X

@ymcki
Copy link

ymcki commented Oct 5, 2024

Thanks max for the reply. I can now create the ggufs.

Based on what I read from my understanding of what people posted at LocalLlama, Q4_0_8_8 is optimized for ARM device with SVE support, Q4_0_4_8 is for ARM with I8MM support and Q4_0_4_4 is for ARM with NEON support. Is that correct? Is Q4_0_8_8 the fastest implementation, followed by Q4_0_4_8 and then Q4_0_4_4?

I noticed that Q4_0 while has the same size as these three models, it has different md5sum. Is it safe to say that it is optimized for NVIDIA, so it is likely to not work as well compare to Q4_0_4_4 on typical ARM smartphones?

People also said these models are optimized for CPUs (presumably Intel/AMD), so which Q4_0 model is best for them? Does support of AVX512 matter?

The new Apple M4 chip supports SME which is a more advanced version of SVE. Does that mean it is best to use Q4_0_8_8?

I am in the process of making small models for edge devices. So it would be great if I can make ggufs that are optimized for different ARM smartphones/tablets.

Thanks a lot in advance.

@Srihari-mcw
Copy link
Contributor Author

Srihari-mcw commented Oct 5, 2024

Hi @max-krasnyansky , I believe that something like ggml-interleave could be a more appropriate name for the file instead of having the implementations in a separate file as for the other implementations(ggml-quants.c, sgemm.cpp etc) we have all the different implementations for different architecture in the same file separated by flags. Maybe the authors/maintainers of llama.cpp could take an appropriate call if needed here. Thanks

Hi @ymcki , the usage of Q4_0_8_8 models in AVX2/AVX512 SIMD machines is expected to go faster than normal Q4_0 models here, as evidenced by these results and PR 8713. Usage of AVX512 is expected to help it go faster than that of AVX2 machines. Thanks

@slaren
Copy link
Collaborator

slaren commented Oct 5, 2024

In the future we should probably look into performing the conversion to Q4_0_x_x at runtime while loading, in a similar way backends like CANN and AMX (in development) convert some types to their preferred layout. The disadvantage is that it would require disabling mmap, however.

@ymcki
Copy link

ymcki commented Oct 7, 2024

Thanks everyone for the inputs. I updated the README.md for my hf repository to describe the other Q4_0 models:

https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF

Please let me know if I made any factual mistakes.

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants