-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it looks like we need a quantization type with about 4 bpw to close the gap between IQ3_M and Q4_K.
Yes, I agree
The question is: |
At 5 bits and above there isn't much gain from alternative quantization, at least not for the models that I'm using for testing where, once you use an imatrix, |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
fun fact, |
@ikawrakow Thanks for the amazing work. While testing IQ3_S/IQ3_M from #5676 I'm getting segfault when using more than 2 threads with I added the output here #5676 (comment). |
@dranger003 Can you post a failing model somewhere where I can download? I have quantized many models with these quantization type without issue (and yes, I'm always using multi-threading), so don't know what could be wrong without a test case. |
@ikawrakow Yes, although you might hate me quite a bit given its size. See here. EDIT: Adding details here as I find out more, hopefully this can help. Another finding is that it crashes using 8 or 12 threads but it doesn't crash using 2 or 16 threads. I have devtools installed and can debug the code if you need me to lookup something specific, but I just don't know where to look otherwise without some guidance. EDIT2: I think this may be a race condition and not directly tied to the thread count. For example, if I run the quantize several times in a row with the same thread count, say 12, then after a number of failed attempts one of the run will go through fine. Also, I just tested IQ2_S/IQ2_M and I get the same behavior. I have been quantizing several models and I only get this issue with the new IQ3/IQ2 quant types. |
…on range (ggerganov#5721) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…on range (ggerganov#5721) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This PR adds two new quantization types,
IQ2_S
andIQ2_M
, to complete the coverage of the 2-3 bit quantization range.Why? The reason for having all these new quantization types is best explained with the following graph, which shows the quantization error defined as
PPL(Q)/PPL(fp16)-1
as a function of bits-per-weight (bpw). The bpw is for the complete model, includingoutput.weight
andtoken_embd.weight
tensors. The data is for LLaMA-v2-13B, but other models show a very similar behavior.The black/blue symbols show the results for k-/legacy quants using 668b31f, which is the last commit before I started adding i-quants and imatrix stuff. The red symbols represent the new i-quants and updated k-quants, including
IQ2_S
andIQ2_M
added by this PR; magenta circles are for legacy quants (with all i-, k-, and legacy quants using imatrix fromwiki.train.raw
). So, in a nutshellQ4_1
behaves as expected instead of having a higher quantization error thanQ4_0
as it often was the case).Q2_K -> IQ3_XXS, Q3_K_S -> IQ3_XS, Q3_K_M -> IQ3_S, Q3_K_L -> IQ3_M
Interestingly enough, the
IQ2_XXS...IQ3_M
quantization error can be described with a simple fit in the form ofa * exp(-b * bpw)
. The 1.5 bpw quantizationIQ1_S
(which I'm not showing here to not have too a large y-axis range) nearly falls onto the same fit. If we were able to keep the rate of quantization error reduction with bpw beyond 4 bpw, we would getQ6_K
performance at about 5.3 bpw.To me it looks like we need a quantization type with about 4 bpw to close the gap between
IQ3_M
andQ4_K
.