-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.5 bit quantization #5453
1.5 bit quantization #5453
Conversation
I converted Miqu 70b and Kyllene 34b, and tested them on my KoboldCPP and ST |
Pushed a small improvement. The perplexities now are like this:
If you want to play with this quantization, it is worth experimenting with the
For the LLaMA models the best results (as in lowest perplexities) are obtained using
One does not gain by modifying
|
Superb job. I will test on 70b models your improvements as they come, starting with this update. I remember your tests about epsilon values a while ago, the 5e06 being better than the 1e05 in some case for Llama 2 models if I recall properly. |
You simply change it on the command line when you use the model, not when you quantize. You quantize once without worrying about
to the command arguments, and it will use the value you specified. |
On Miqu 70B IQ1_S (quant made this night with commit 2ee4281 ), I get the following error when I test the perplexity with your b2128 merge with your IQ1_S improvement ( 9803f7a )
With 34b Yi models, no problem. |
Nothing has changed with the last commit that would make a difference in the ability to run the model. It is the exact same size. I think the issue is that you are using |
The PR #5452 merged a few hours ago should have fixed the out of space errors with some batch sizes. You shouldn't get this error anymore if you merge master into this PR. |
@ikawrakow : I tested that indeed with -b 512, and I was about to report that it works. @slaren : I'll test this fix, thanks! |
With PR 5452 merge on the IQ1_S branch, I get that 👍
There's still a problem, but it doesn't crash anymore. |
These messages are only printed in debug builds, and do not necessarily indicate a problem, it's just a trace of what's happening in the allocator. |
Ok! |
@Nexesenex For readability, please paste output into code blocks, ``` |
OK, another update (apart from merging latest master to pick up #5452): using
Mixtral8x7B with 3 experts has |
The second version of IQ1_S showed massive progresses already. I will test the third tonight. Noted for the code blocks, but I fail to achieve it properly obviously. :X |
Is all of this applicable to higher bpw quants? |
It does. Here are some results on my 6750XT.
For reference, the PPL for this model (Fish-8x7B-IQ1_S) over wiki.test.raw, at 512 context is 8.1976 +/- 0.05212. |
Here are some benchs, @ikawrakow :
On Miqu 70b, the progression is neat from one revision to another. On a Yi 34b model, it's a bit weirder. As for the output of the Miqu 70b in IQ1_S "v3", it's definitively better than v1. The formatting is more respected, the model can make a relatively sensical and detailled answer to a question, with developments. Example obtained with my last Frankenstein Kobold CPP (IQ1_V3) :
But still, I think it needs some slight quality bump (or an intermediary IQ1 quant halfway to iq2_xxs) in order to be really usable, especially if we consider the models below 70b for which the results are weirdly incoherent between the v2 and v3 of the IQ1_S PR. |
@ikawrakow et. al. who are helping implement & test this -- bravo, this is superb to see! I think there are many use cases for these kinds of "as high quality and speed as possible" 1-4 bit per parameter realm quantizations pragmatically simply to fit next-echelon size range models Particularly above ~16-24GBy VRAM many people run out of practical ability / tolerance to I have no idea to what extent it may be useful for comparison of some of these highly quantized techniques, but there is this newer model out there that may be an interesting larger size test subject and also may be relatively unlikely to receive such contemporary SOTA high factor quantizations any time soon otherwise (until this work reaches mainstream use) but which can be compared with older construction heuristics Q2/Q4 options present. e.g. https://huggingface.co/abacusai/TheProfessor-155b https://huggingface.co/abacusai/TheProfessor-155b-gguf/tree/main (there is a Q2, Q4 there, so would be a good apples-to-apples comparison perhaps going to lower / better calculated quants) |
I want to thank you @ikawrakow for the amazing job! I've rented out a box with A5000, and downloaded @Nexesenex miqu q1_s_v2: here's my perplexity outputs running on
I hope this is useful. I have some saved outputs on low temperature with miqu q5_k_m, when compared to the output of iq1_s_v2(with default settings) it's of comparable quality. Looking forward to being able to run this on M1! EDIT: Did v3 too:
|
Thank you all for testing and feedback. So, what do we do with this? Leave it as a demo, or add the missing kernels and merge? @ggerganov |
It's fine to merge it. The bullet points though would have to remain for the future |
@ikawrakow Just wondering is this 1.5bpw end to end (including the embedding + lm head) or just the decoder weights? Do you know if BiLLM quantizes the embedding + lm head or just the decoder weights? |
For the results reported in this PR:
We end up using effectively about 1.8 bpw for 7B models, or about 1.69 bpw for 70B, when everything is counted in. If the approach was implemented in a different framework that does not require blocks of predetermined size as I don't know what BiLLM does. But my experience with researchers putting papers up on arxiv is that they don't care about token embedding and output tensors and never count them in the balance, so my assumption is that this applies to the BiLLM paper as well. The paper doesn't mention bits spent on block scales (they do have those as per paper), or any other meta data they may need. So, overall, hard to tell what the actual total balance will be when everything is said and done. I did try to test, but gave up after unsuccessfully fighting with their requirements. One more thing: as I wrote in the PR description above, there is currently no way in |
I can't get CPU (AVX2) to work. GPU (ROCm) works fine.
|
#if QK_K == 256 | ||
const int il = tid/8; // 0...3 | ||
const int ib = tid%8; // 0...7 | ||
dst_t * y = yy + i*QK_K + 32*ib + 8*il; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this kernel is in a critical path for performance, it may help performance a bit if we ensure that y
is aligned on 16-byte boundary (if that's that's the case). Right now all results are stored in y
one by one and for this relatively low-compute kernel memory i/o will likely be the bottleneck. Aligned pointer would allow compiler to perform the store as a single 128-bit operation.
https://cuda.godbolt.org/z/a1Eczhv5a
@Artefact2 for me, with CPU (AVX2), i'm able to get the quantized 8x7B instruct model to produce proper results on ubuntu jammy. See if the HF converted instruct model works. Tested on this calibration text - #5263 (comment) I did get the same gibberish as you with an ARM device. |
* iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@ikawrakow take a look at that paper, they have some insane PPL, maybe you can use their techniques to improve your craft. |
@BadisG Normally the |
3.31 fp16 is what you get if you eval 2-70b at ctx 2048 (table 2 in quip#). The DBLLM paper reports results with a group size of 64, which means they are actually using x + 16/64 bpw. For 2 bits, they're using 2.25, which is a significant difference.
Btw @ikawrakow you may be interested in knowing that quip# 2-70b 1 bit gets 5.9x ppl using the existing method in the paper. This model is only 9GB end to end with the embedding and lm head. I tweeted about it a few days ago but not sure if you saw.
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Kawrakow ***@***.***>
Sent: Tuesday, February 20, 2024 11:09:58 AM
To: ggerganov/llama.cpp ***@***.***>
Cc: Albert Tseng ***@***.***>; Comment ***@***.***>
Subject: Re: [ggerganov/llama.cpp] 1.5 bit quantization (PR #5453)
@BadisG<https://github.com/BadisG> Normally the fp16 PPL for LLaMA-2 reported in papers is 5.12 (7B), 4.57 (13B), and 3.12 (70B). Go figure what these results are supposed to be. Especially considering that their LLaMA-1 fp16 perplexities match the commonly reported values. Why is this important? Because if they just got confused and put the wrong fp16 values in the LLaMA-v2 table, the results are not particularly impressive (you can find better results for instance in this paper<https://arxiv.org/pdf/2402.04396.pdf>. But if the results are for a smaller context than the 4096 commonly used for LLaMA-2 to report PPL results in papers, then they may have achieved something noteworthy. I tend to assume the former is true based on their LLaMA-1 results being not particularly impressive (e.g, the 2-bit quantization IQ2_XXS in this repo has a PPL of 4.30 for LLaMA-v1-65B).
—
Reply to this email directly, view it on GitHub<#5453 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6WZSCTCY2L733NIG6QFPLYUTDFNAVCNFSM6AAAAABDDU5WG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJUGU2TANJRHE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
@tsengalb99 No I haven't seen your tweet (I'm a relic of the past who does not hang around social networks such as Twitter, err, X). Congratulations! When you say "end-to-end", do you mean really end-to-end, as stored on disk and including output tensor? But either way, 5.9 is higher than this PR even after factoring in the ~3% difference in LLaMA-v2 Concerning the quoted paper: thanks for the clarification. If their quantization is 2.25 bpw, then we need to compare to |
Size on disk is 9629941760 bytes as reported by du. Last I checked your PPL numbers are not comparable to any of the academic numbers since your FP16 is different. I am not sure if you changed how you measure PPL in the last few months, but IIRC from the Q2K comparison the “academic PPL” is higher than your PPL. You should strongly consider using the same PPL measurement as academic papers (or at least don’t compare two different PPL measurement methods against each other because that makes no sense) and testing on more than Wikitext2.
From: Kawrakow ***@***.***>
Sent: Tuesday, February 20, 2024 12:35 PM
To: ggerganov/llama.cpp ***@***.***>
Cc: Albert Tseng ***@***.***>; Mention ***@***.***>
Subject: Re: [ggerganov/llama.cpp] 1.5 bit quantization (PR #5453)
@tsengalb99 <https://github.com/tsengalb99> No I haven't seen your tweet (I'm a relic of the past who does not hang around social networks such as Twitter, err, X). Congratulations! When you say "end-to-end", do you mean really end-to-end, as stored on disk and including output tensor? But either way, 5.9 is higher than this PR even after factoring in the ~3% difference in LLaMA-v2 fp16 PPL between llama.cpp and the Python PPL calculation. Sure, your model is smaller, but if producing basically useless quantized LLM models became the priority of the day, I'm sure the apparatus necessary to shrink the IQ1_S model from currently 13.5 GB to something more in line with your 9 GB will get implemented also here, so there wouldn't be much difference. Oh, wait, there will be one: 3 CPU minutes quantization time for LLaMA-v2-70 vs your 60 GPU hours.
Concerning the quoted paper: thanks for the clarification. If their quantization is 2.25 bpw, then we need to compare to IQ2_XS from this repo, which has a LLaMA-v1-65B PPL of 4.07, so miles ahead of their 4.84 (and I don't feel like running LLaMA-v2 at a context of 2048 to also compare this result).
—
Reply to this email directly, view it on GitHub <#5453 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH6WZSAOMH5GCDSLYYOFC63YUTNFVAVCNFSM6AAAAABDDU5WG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJUG4YTCMZVHE> .
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@tsengalb99 The ratio |
@ikawrakow : Indeed, it's worth it, one can decrease perplexity by almost 1% with the right value (9-e05 seems to be pertinent value for the infra 2bpw quants). By the way, I reiterate my request to be able to set that parameter during quantization, so the correct Also, do you have in store something smaller than the current GGML_TYPE_IQ1_S, which could be a IQ1_XS? Of course, it could not be efficiently used for all tensors, but it could most probably be used for attn_q.weight, and to some extend for attn_k.weight or even ffn_up / ffn_gate. I don't need that to be integrated in an existing or a new LLAMA_FTYPE_MOSTLY_XXX quant, just such GGML_TYPE_IQ1_XS to be available for usage in a quant strategy. |
* iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This draft PR is a WIP that demonstrates 1.5 bits-per-weight (bpw) quantization. Only CUDA works, there is no implementation for the other supported back-ends. CUDA, AVX2 and ARM_NEON are implemented, Metal is missing. But given the keen interest in 1-bit quantization, and the recent hype out there (e.g, BiLLM paper, PB-LLM paper), I decided to show what I have so far to see if I should proceed.Given the ongoing interest in low-bit quantization and recent papers (PB-LLM, BiLLM), this PR adds a 1.5-bit quantization as
IQ1_S
.Don't expect literary or otherwise masterpieces. But it is not complete gibberish either.
The table shows a PPL comparison between this work and the two papers linked above. The PPL values for BiLLM and PB-LLM were taker from the BiLLM paper, so LLaMA-v1 and LLaMA-v2 only (Note: I have edited the PR description to put the final version here).
This is the PPL comparison as initially posted:
Here the responses to the sample prompts from the BiLLM paper using LLaMA-v1-13B:
PB-LLM is 1.7 bpw, so this PR is massively better. BiLLM claims ~1.1 bpw (but we don't know the final balance after block scales and bits for non-repeating layers have been added), so it is not surprising to see a better result in this PR with 1.5 bpw.
CUDA performance is impressive: 212 t/s for a 7B model, 130 t/s for 13B, and 33.5 t/s for 70B running on an RTX-4080. Oh, LLaMA-v2-70B finally fits on my 16 GB GPU!
The BiLLM approach separates salient and non-salient weights. They use 2 bpw for salient and 1 bpw for non-salient weights (and so, if one declares about 10% of the model weights to be salient, one needs 1.1 bpw). The thing about separating salient and non-salient weights is that this separation already costs 1 bpw, unless one has a better idea. This is the key insight of the BiLLM paper. They basically make a per tensor column separation. This could easily be done here too (one takes the imatrix, which is already per column, multiples with the sum of the model weights in the column squared, and uses this as a measure to pick the top-k percent of columns). Unfortunately
ggml
lacks the infrastructure to do that sort of thing. Hence, this PR uses the same quantization for all weights. Unlike the quoted papers, which have binary quants (-1, 1), I use 3 allowed values (-1, 0, 1), and squeeze to 1.125 bpw by selecting 512 8D points out of the 3^8 = 6561 possibilities. This is similar to theIQ2_XS
quants, but here it is no longer an E8 lattice as I do not impose the condition of the sum of the co-ordinates to be even. With additional 3 bits for an unsigned scale per group of 8, we end up with 1.5 bpw. If we wanted to futher squeeze the model, the salient/no-salient separation will be essential. For this, I would need support from @ggerganov and @slaren to haveggml
op that takes a tensor holding activations and reorders the columns as per the alient/non-salient separationggml
where the assumption is being made that tensor rows are made up of a given number of block structs with a fixed size.