Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532

Srihari-mcw · 2024-09-18T08:34:02Z

The PR contains AVX512 Version of ggml_gemm_q4_0_8x8_q8_0 used for Q4_0_8_8 quantized models
Good gains were seen especially with prompt processing with the above changes compared to existing AVX2 version
PR introduces AVX512 integer variant for mul_sum_i8_pairs function for performing dot product operations and macros for conversion from half precision to full precision based on F16C intrinsics support

GCC Linux :

Q4_0_8_8 Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	6	pp 512	67.87 ± 0.15		faf67b3
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	6	pp 512	103.70 ± 0.33	52.8%	a829583
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	6	tg 128	15.13 ± 0.00		faf67b3
llama 7B Q4_0_8_8	3.56 GiB	6.74 B	CPU	6	tg 128	15.12 ± 0.00	0%	a829583

GCC Version = 12.3

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Srihari-mcw · 2024-09-18T14:53:29Z

The perplexity was measured for models quantized from meta llama2 7B model with the following command :
./llama-perplexity -m <model_name> -f wikitext-2-raw/wiki.test.raw

It calculated perplexity over 655 chunks :
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4

The perplexity results are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
llama 7B Q4_0_8_8	5.9625 +/- 0.03348	faf67b3
llama 7B Q4_0_8_8	5.9625 +/- 0.03348	a829583

The perplexity readings were found to be the same before and after the changes

ggml/src/ggml-aarch64.c

max-krasnyansky · 2024-09-23T17:18:09Z

Ah. Sorry I didn't comment in time.
Ideally this should go into ggml/src/ggml-x64.c (aka x86-64 & amd64)

ymcki · 2024-10-04T13:16:02Z

Is quantitization of safetensors to gguf in Q4_0_8_8, Q4_0_4_8, Q4_0_4_4 implemented?

It seems to me there are no corresponding classes in gguf-py/gguf/quants.py, so probably not?

Then, how did bartowski create those ggufs?
https://huggingface.co/bartowski/magnum-v2-4b-GGUF

max-krasnyansky · 2024-10-04T18:39:47Z

Is quantitization of safetensors to gguf in Q4_0_8_8, Q4_0_4_8, Q4_0_4_4 implemented?

It seems to me there are no corresponding classes in gguf-py/gguf/quants.py, so probably not?

Then, how did bartowski create those ggufs? https://huggingface.co/bartowski/magnum-v2-4b-GGU`F

llama-quantize tool can produce those types.

i.e. just use the normal model conversion/quantization flow:

convert_hf_to_gguf.py to safe F16 GGUF
llama-quantize to convert F16 to Q4_0_X_X

ymcki · 2024-10-05T01:50:08Z

Thanks max for the reply. I can now create the ggufs.

Based on what I read from my understanding of what people posted at LocalLlama, Q4_0_8_8 is optimized for ARM device with SVE support, Q4_0_4_8 is for ARM with I8MM support and Q4_0_4_4 is for ARM with NEON support. Is that correct? Is Q4_0_8_8 the fastest implementation, followed by Q4_0_4_8 and then Q4_0_4_4?

I noticed that Q4_0 while has the same size as these three models, it has different md5sum. Is it safe to say that it is optimized for NVIDIA, so it is likely to not work as well compare to Q4_0_4_4 on typical ARM smartphones?

People also said these models are optimized for CPUs (presumably Intel/AMD), so which Q4_0 model is best for them? Does support of AVX512 matter?

The new Apple M4 chip supports SME which is a more advanced version of SVE. Does that mean it is best to use Q4_0_8_8?

I am in the process of making small models for edge devices. So it would be great if I can make ggufs that are optimized for different ARM smartphones/tablets.

Thanks a lot in advance.

Srihari-mcw · 2024-10-05T15:30:42Z

Hi @max-krasnyansky , I believe that something like ggml-interleave could be a more appropriate name for the file instead of having the implementations in a separate file as for the other implementations(ggml-quants.c, sgemm.cpp etc) we have all the different implementations for different architecture in the same file separated by flags. Maybe the authors/maintainers of llama.cpp could take an appropriate call if needed here. Thanks

Hi @ymcki , the usage of Q4_0_8_8 models in AVX2/AVX512 SIMD machines is expected to go faster than normal Q4_0 models here, as evidenced by these results and PR 8713. Usage of AVX512 is expected to help it go faster than that of AVX2 machines. Thanks

slaren · 2024-10-05T16:06:50Z

In the future we should probably look into performing the conversion to Q4_0_x_x at runtime while loading, in a similar way backends like CANN and AMX (in development) convert some types to their preferred layout. The disadvantage is that it would require disabling mmap, however.

ymcki · 2024-10-07T03:36:33Z

Thanks everyone for the inputs. I updated the README.md for my hf repository to describe the other Q4_0 models:

https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF

Please let me know if I made any factual mistakes.

* AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

AVX512 version of ggml_gemm_q4_0_8x8_q8_0

a829583

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 18, 2024

Srihari-mcw added 3 commits September 23, 2024 00:16

Remove zero vector parameter passing

7aee79b

Rename functions and rearrange order of macros

7436d52

Edit commments

14d2abb

ggerganov approved these changes Sep 23, 2024

View reviewed changes

style : minor adjustments

407910f

ggerganov reviewed Sep 23, 2024

View reviewed changes

ggml/src/ggml-aarch64.c Outdated Show resolved Hide resolved

Update x to start from 0

448e4a9

ggerganov merged commit 1e7b929 into ggerganov:master Sep 23, 2024
53 checks passed

EZForever mentioned this pull request Sep 24, 2024

ggml : add AVX512DQ requirement for AVX512 builds #9622

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532

Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532

Srihari-mcw commented Sep 18, 2024

Srihari-mcw commented Sep 18, 2024 •

edited

Loading

max-krasnyansky commented Sep 23, 2024

ymcki commented Oct 4, 2024

max-krasnyansky commented Oct 4, 2024

ymcki commented Oct 5, 2024

Srihari-mcw commented Oct 5, 2024 •

edited

Loading

slaren commented Oct 5, 2024

ymcki commented Oct 7, 2024

Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532

Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0 #9532

Conversation

Srihari-mcw commented Sep 18, 2024

Srihari-mcw commented Sep 18, 2024 • edited Loading

max-krasnyansky commented Sep 23, 2024

ymcki commented Oct 4, 2024

max-krasnyansky commented Oct 4, 2024

ymcki commented Oct 5, 2024

Srihari-mcw commented Oct 5, 2024 • edited Loading

slaren commented Oct 5, 2024

ymcki commented Oct 7, 2024

Srihari-mcw commented Sep 18, 2024 •

edited

Loading

Srihari-mcw commented Oct 5, 2024 •

edited

Loading