Skip to content

add hf2gguf conv format of q4_0 q4_1 q5_0 q5_1 #9022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

chentyjpm
Copy link

@github-actions github-actions bot added the python python script changes label Aug 14, 2024
@ngxson ngxson requested a review from compilade August 14, 2024 08:17
@compilade
Copy link
Collaborator

The main reason I'm hesitant to add this is that llama-quantize uses Q4_K and Q6_K for the token embeddings when quantizing to Q4_0, Q4_1, Q5_0, or Q5_1, and so unlike --outtype q8_0, this is not equivalent to using llama-quantize.

Although I did make an exception for this in #8151 for TQ1_0 and TQ2_0.

Maybe a temporary workaround could be a clear warning in the help text of --outtype, and/or at the end of conversion with these types.

@ggerganov
Copy link
Member

Yes, it will cause confusion having different mixtures called the same way. Better to not add this functionality in the python scripts

@chentyjpm
Copy link
Author

The main reason I'm hesitant to add this is that llama-quantize uses Q4_K and Q6_K for the token embeddings when quantizing to Q4_0, Q4_1, Q5_0, or Q5_1, and so unlike --outtype q8_0, this is not equivalent to using llama-quantize.

Although I did make an exception for this in #8151 for TQ1_0 and TQ2_0.

Maybe a temporary workaround could be a clear warning in the help text of --outtype, and/or at the end of conversion with these types.

thanks for review !

I read for cpp code in "static void llama_model_quantize_internal" function . er... but i did not find the place where token embeddings is insterted to quantized model.
can I make python code same as llama-quantize by to add Q4_K and Q6_K for the token embeddings for converted model

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 23, 2025
This, in order to make smaller conversions to generate an imatrix.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022
Original author @chentyjpm
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 23, 2025
This, in order to make smaller conversions to generate an imatrix.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022
Original author @chentyjpm
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 23, 2025
This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022
Original author @chentyjpm
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 23, 2025
This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 23, 2025
This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 24, 2025
This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.
ikawrakow pushed a commit to ikawrakow/ik_llama.cpp that referenced this pull request May 24, 2025
* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 1, 2025
)

* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 1, 2025
)

* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 1, 2025
)

* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 2, 2025
)

* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : ggml-org#9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants