Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Feb 26, 2024

This PR adds two new quantization types, IQ2_S and IQ2_M, to complete the coverage of the 2-3 bit quantization range.

Why? The reason for having all these new quantization types is best explained with the following graph, which shows the quantization error defined as PPL(Q)/PPL(fp16)-1 as a function of bits-per-weight (bpw). The bpw is for the complete model, including output.weight and token_embd.weight tensors. The data is for LLaMA-v2-13B, but other models show a very similar behavior.

The black/blue symbols show the results for k-/legacy quants using 668b31f, which is the last commit before I started adding i-quants and imatrix stuff. The red symbols represent the new i-quants and updated k-quants, including IQ2_S and IQ2_M added by this PR; magenta circles are for legacy quants (with all i-, k-, and legacy quants using imatrix from wiki.train.raw). So, in a nutshell

  • We now have several quantization options in the sub-3-bit range. Why do we need several? Because the only reason to go to sub-3-bit quantization is to squeeze a large model into the limited RAM/VRAM available, and having several quantization types allows one to select the quantization type with the lowest quantization error that can be used with the available computing platform (fits in RAM/VRAM, has acceptable performance when partially offloading to the GPU, etc.)
  • We have a much lower quantization error in the 3-4 bpw quantization range (note that the y-axis is logarithmic, so the reduction in quantization error is in the 50%-100% range). Alternatively, if we were satisfied with the generation quality of the former 3-bit quantization, we can now have the same with a ~10% smaller model
  • We now have a lower quantization error in the 4+ bit range for k- and legacy quants (and Q4_1 behaves as expected instead of having a higher quantization error than Q4_0 as it often was the case).
  • I think this graph will make it easy to see the rough quantization error correspondence between k- and i-quants: Q2_K -> IQ3_XXS, Q3_K_S -> IQ3_XS, Q3_K_M -> IQ3_S, Q3_K_L -> IQ3_M

legacy_vs_iq_l2_13

Interestingly enough, the IQ2_XXS...IQ3_M quantization error can be described with a simple fit in the form of a * exp(-b * bpw). The 1.5 bpw quantization IQ1_S (which I'm not showing here to not have too a large y-axis range) nearly falls onto the same fit. If we were able to keep the rate of quantization error reduction with bpw beyond 4 bpw, we would get Q6_K performance at about 5.3 bpw.

To me it looks like we need a quantization type with about 4 bpw to close the gap between IQ3_M and Q4_K.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks like we need a quantization type with about 4 bpw to close the gap between IQ3_M and Q4_K.

Yes, I agree

examples/quantize/quantize.cpp Outdated Show resolved Hide resolved
@sorasoras
Copy link

@ikawrakow

Q5KS
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_1:   40 tensors
llama_model_loader: - type q5_K:  161 tensors
llama_model_loader: - type q6_K:    1 tensors

The question is:
Could expect new NL quant on its way like IQ4NL?
IQ5NL and IQ6NL in particular.
There‘re significance improvement by replace that 40 tensor from Q4_0 to IQ4NL without any different in size in QWEN1 at least.
Anyway
Thanks for the hard work

@ikawrakow
Copy link
Contributor Author

The question is:Could expect new NL quant on its way like IQ4NL?IQ5NL and IQ6NL in particular

At 5 bits and above there isn't much gain from alternative quantization, at least not for the models that I'm using for testing where, once you use an imatrix, Q5_0 is basically as good as Q5_K.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@sorasoras
Copy link

sorasoras commented Feb 26, 2024

The question is:Could expect new NL quant on its way like IQ4NL?IQ5NL and IQ6NL in particular

At 5 bits and above there isn't much gain from alternative quantization, at least not for the models that I'm using for testing where, once you use an imatrix, Q5_0 is basically as good as Q5_K.

fun fact,
Q5KS beat Q5KM for my use case with imatrix
The differ is Q6K and Q8_0 where Q5KM use Q8 and Q6K for Q5KS
in my use case.

@dranger003
Copy link
Contributor

@ikawrakow Thanks for the amazing work. While testing IQ3_S/IQ3_M from #5676 I'm getting segfault when using more than 2 threads with quantize on some models. I'll test this PR later today to see if the same issue is present. All other quant types are working fine, so I'm not sure what is different with these (that could be thread related).

I added the output here #5676 (comment).

@ikawrakow
Copy link
Contributor Author

@dranger003 Can you post a failing model somewhere where I can download? I have quantized many models with these quantization type without issue (and yes, I'm always using multi-threading), so don't know what could be wrong without a test case.

@dranger003
Copy link
Contributor

dranger003 commented Feb 26, 2024

@dranger003 Can you post a failing model somewhere where I can download? I have quantized many models with these quantization type without issue (and yes, I'm always using multi-threading), so don't know what could be wrong without a test case.

@ikawrakow Yes, although you might hate me quite a bit given its size. See here.

EDIT: Adding details here as I find out more, hopefully this can help. Another finding is that it crashes using 8 or 12 threads but it doesn't crash using 2 or 16 threads. I have devtools installed and can debug the code if you need me to lookup something specific, but I just don't know where to look otherwise without some guidance.

EDIT2: I think this may be a race condition and not directly tied to the thread count. For example, if I run the quantize several times in a row with the same thread count, say 12, then after a number of failed attempts one of the run will go through fine. Also, I just tested IQ2_S/IQ2_M and I get the same behavior. I have been quantizing several models and I only get this issue with the new IQ3/IQ2 quant types.

@ikawrakow ikawrakow merged commit a33e6a0 into master Feb 26, 2024
60 of 61 checks passed
@ikawrakow ikawrakow deleted the ik/iq2_s_new2 branch February 26, 2024 16:28
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
…on range (ggerganov#5721)

* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…on range (ggerganov#5721)

* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@mofosyne mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants