add hf2gguf conv format of q4_0 q4_1 q5_0 q5_1 #9022

chentyjpm · 2024-08-14T06:45:54Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

compilade · 2024-08-14T15:29:33Z

The main reason I'm hesitant to add this is that llama-quantize uses Q4_K and Q6_K for the token embeddings when quantizing to Q4_0, Q4_1, Q5_0, or Q5_1, and so unlike --outtype q8_0, this is not equivalent to using llama-quantize.

Although I did make an exception for this in #8151 for TQ1_0 and TQ2_0.

Maybe a temporary workaround could be a clear warning in the help text of --outtype, and/or at the end of conversion with these types.

ggerganov · 2024-08-15T06:20:27Z

Yes, it will cause confusion having different mixtures called the same way. Better to not add this functionality in the python scripts

chentyjpm · 2024-08-15T06:27:25Z

The main reason I'm hesitant to add this is that llama-quantize uses Q4_K and Q6_K for the token embeddings when quantizing to Q4_0, Q4_1, Q5_0, or Q5_1, and so unlike --outtype q8_0, this is not equivalent to using llama-quantize.

Although I did make an exception for this in #8151 for TQ1_0 and TQ2_0.

Maybe a temporary workaround could be a clear warning in the help text of --outtype, and/or at the end of conversion with these types.

thanks for review !

I read for cpp code in "static void llama_model_quantize_internal" function . er... but i did not find the place where token embeddings is insterted to quantized model.
can I make python code same as llama-quantize by to add Q4_K and Q6_K for the token embeddings for converted model

@chentyjpm

This, in order to make smaller conversions to generate an imatrix. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm

@chentyjpm

This, in order to make smaller conversions to generate an imatrix. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm

@chentyjpm

This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm

@chentyjpm

This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

@chentyjpm

This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

@chentyjpm

This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

@chentyjpm

* Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention

@chentyjpm

) * Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention

@chentyjpm

) * Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention

@chentyjpm

) * Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention

@chentyjpm

) * Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention

add hf2gguf conv format of q4_0 q4_1 q5_0 q5_1

57b79fd

github-actions bot added the python python script changes label Aug 14, 2024

ngxson requested a review from compilade August 14, 2024 08:17

add warning code when quantizing to Q4_0, Q4_1, Q5_0, or Q5_1

7d261a9

Merge branch 'ggerganov:master' into master

4adb77f

compilade mentioned this pull request Oct 31, 2024

feat(convert_hf_to_gguf): support q4_0 and q4_1 quantifications #10008

Open

4 tasks

Nexesenex mentioned this pull request May 23, 2025

Legacy quants conversion schemes in convert_hf_to_gguf.py ikawrakow/ik_llama.cpp#449

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add hf2gguf conv format of q4_0 q4_1 q5_0 q5_1 #9022

add hf2gguf conv format of q4_0 q4_1 q5_0 q5_1 #9022

Uh oh!

chentyjpm commented Aug 14, 2024

Uh oh!

compilade commented Aug 14, 2024

Uh oh!

ggerganov commented Aug 15, 2024

Uh oh!

chentyjpm commented Aug 15, 2024

Uh oh!

Uh oh!

add hf2gguf conv format of q4_0 q4_1 q5_0 q5_1 #9022

Are you sure you want to change the base?

add hf2gguf conv format of q4_0 q4_1 q5_0 q5_1 #9022

Uh oh!

Conversation

chentyjpm commented Aug 14, 2024

Uh oh!

compilade commented Aug 14, 2024

Uh oh!

ggerganov commented Aug 15, 2024

Uh oh!

chentyjpm commented Aug 15, 2024

Uh oh!

Uh oh!