- 
                Notifications
    You must be signed in to change notification settings 
- Fork 155
Adding IQ2_KL #602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding IQ2_KL #602
Conversation
| Thanks again, IK, for the quant and the explanations! Already merged on my Croco, and it works like a charm on Cuda inference. For the anecdote, I quantized a Miqu 70 b for my Mono-3090 back then, with mainline at the time : llama_model_loader: - type  f32:  161 tensors And now, almost one year an a half later: The recipe is a bit different, the size a bit higher, but Miqu's PPL being around 3.70 in q8_0, there's quite an overall jump in quality with all the work you did on IK_Llama, even if we account for recipe modulation, and even if you "deoptimized" some quants in respect for the legacy Llama models class to favor more tricky weights like L3 and the like. Anyway, IQ2_KL is SOTA imo, quality and speed-wise. Congratulations! As for popular demand, the "people" might now wonder if the difference between IQ2_K/IQ2_S and IQ2_KL, for which you used you IQ3_KS, might be reproducible between IQ3_K/IQ3_S and an hypothetical IQ3_KL 3.6-3.8bpw, (with the help of IQ4_KS?). One might read with horror and contempt such an easy transposition, but now that the IQ2_S -> IQ3_KS gap has been quite well filled, remains the IQ3_K -> IQ4_KS gap (the IQ4_KSS that you so kindly developed after a popular request back then being more a side quant due to its complex packaging, in respect for a Cuda MMQ Kernel for example, from what I could understand). The 3.5bpw quants have always been a bit tricky in my different tests, Q3_K now being obsolete, and IQ3_S / IQ3_K having somehow becoming subpar compared to the developments you made in the 4-4.5 bits and 2-2.75 bits range. Btw, I listened to your intervention on Fosdem. It was nice to learn a bit about your background and to hear you, Iwan. | 
| 
 Haha, I knew you will ask that. A similar approach does not work there because a pair of quants at 3.5 bpw is 7 bits, so 128 possibilities, so fast CPU shuffle instructions are not possible, and one would be back to slow lookup tables. Something else is needed for that gap. To expand a bit more on that, a Trellis quant at 3.5 bpw (plus block scale bits) looks pretty promising. But the problem with Trellis quants is their lower CPU TG performance. Considering that most people seem to be using  | 
| Well, I wondered if it would be that easy.. I'm so predictable indeed! ^^ As for a Trellis 3.5bpw, a 10% TG drop compared to what folks are using ain't too much of a big hassle, but 20% is really felt, that's for sure, especially in the single digits T/S. At least, that's my perception. This being said, you bumped already the TG performances of Trellis on CPU, displacing the hard barrier towards the memory bandwidth. Sometimes we gain for free, sometimes we trade-off. And maybe you'll have another epiphany, says the profane! Even without yet another TG bump for Trellis, considering the recent improvements about selecting the tensors you upload and those you don't for those using NVidia GPUs (on which Trellis is very competitive), also considering that most FTypes, especially those cooked by us enthusiasts, are not pure, the 20% drop might not be achieved often, because only some tensors and not other would be quantized in IQ3_KTL 3.5bpw. Personally, I'd probably use an IQ3_KTL ggml_type for either the attn_k and attn_o, either the ffn_down, either the ffn_gate and up, either the attn_q, accordingly to the overall quant quality I'm searching for in respect for the size of the model and the context size desired. IQ2_KT is a no brainer in its category, but IQ3_KS is quite competitive with IQ3_KT, and with a bigger delta bpw, IQ4_KS with IQ4_KT, including in quantization time! It's all about making a good mix between quality, size, and speed, not to speak about quantization time, between the available ggml_types to make an adequate FType. As for the giant MOEs, they are an important niche in respect for all the work you accomplished on IKL, but the number of users able to run them is limited to well off enthusiasts and devs, academics with access to powerful workstations, and corpos/gov. And these giant models are most probably quite rarely ran on CPU only by those folks. ^^ That's my 2 cents. | 
| 
 Is there any chance you could reconsider posting them? I think there is never going to be consensus on the best measure of quantization quality, because that differs based on user and use case, but the metrics you provided were useful for people to see roughly where quantization quality lies between quant types. The open PR for the readme update has the additional benefit of making it easy to find and get to a PR where that is discussed as it is usually contained in the non row interleaved, CPU implementation of a quant, which is easy to get to from the table (as it is the first link in each column), and I do think it is quite useful for people who don't have strong opinions about PPL vs alternative metrics (which I believe is the majority). I'm guessing you changed your mind around  | 
| Great job with this  I had a bunch of old "pure" Qwen3-14B dense GGUFs sitting around from previous experiments so added a few of the new types including this sweet    And yeah while the  Looking forward to incorporating this in some future mixes. I'll wait until its merged before releasing anything this time 💀 😂 | 
| 
 I forgot to comment that very line, @ikawrakow, and I second the request of @saood06. Moreover, I do not understand the "perplexity tells us nothing" comment which has been made to you by -I don't know whom among my betters-. Non-withstanding the mathematical purity and/or sophistication of some other benchmark, Perplexity, aka. the "benchmark of the people", is a very clear indicator over the quality of the pretraining of a model, the damages made by its instruct training, and the quantization of the weights compared to their f16/bf16/f32 originals, and I could verify that on many models archs and finetunes/merges, both in use (including on long contexts) and with perplexity tests in several languages, English, French, but also Serbo-Croatian which is a minor one in pre-training. The deltas in quantization (almost, to leave room for exceptions) always show in comparable (order of magnitude) proportions among different languages, even if they are not the same from one language to another, and so, both the baseline and the variation is relevant, if not the most relevant benchmark. Being one of the bests in the field which pertains to IKL's developments, and the one who's pulling the work, I think that you can trust your own judgement over what is an adequate benchmark for your quants! @ubergarm : thanks for your clean benches! | 
At least according to rmse, this is significantly better than q2_K, while using only 1/16 more bits per weight.
Also check the two neighbouring values for the block scale and use the one that minimizes RMSE.
Quite good: PP-512(L3-8B) = 8472 t/s.
We get PP-128(L3-8B) = 162 t/s. Which means that this is not quite as good as it should be as (almost) same bpq q2_K is at 170 t/s.
Not particularly fast. I may need to think about rearranging the bits.
The compiler started crashing!!!
Had to work around a compiler crash when using vzip2q_u8 using vqtbl2q_u8.
PP-512 goes to 476 t/s up from 466 t/s.
PP-512 goes to 492 t/s up from 476 t/s.
| It is strange that IQ2_KS/L have a lower PP performance. They are supposed to be ~20% faster than Q4_0 | 
| 
 I was surprised when I saw the Q4_0 was faster on my Zen5 9950X. I just re-ran the benchmarks on a Thread Ripper Pro 24x core - same quants just using 24x cores now and more RAM bandwidth.   👈 DetailsQ4_0 7.925 GiB (4.609 BPW)
 IQ2_KL 5.141 GiB (2.990 BPW)
 IQ2_KS 4.372 GiB (2.543 BPW)
 IQ2_KT 4.280 GiB (2.489 BPW)
 fwiw here are the cpu flags on both rigs: The 9950x has  | 
| For PP  Apart from this, yes, up to 100 GB/s or so memory bandwidth is fully saturated for TG. It still looks quite OK on the 795WX, where we are getting 160-180 GB/s. But beyond 200 GB/s something happens, and we cannot get anywhere close to the theoretical limit for the 400+ GB/s systems. | 
| 
 Interesting, yes, let's measure: I made two test quants:    👈 Details# Qwen3-14B-Q8_K_R8.gguf
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q8_k_r8:  281 tensors
# Qwen3-14B-Q8_0_R8.gguf
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q8_0_r8:  281 tensorsQ8_K_R8 7965WX 24x Core
 Q8_K_R8 9950X 16x Core
 Q8_0_R8 7965WX 24x Core
 Q8_0_R8 9950X 16x Core
 | 
| Wow, that's a bummer. Does a different compiler get used on the 9950X? If you have time to experiment: can you comment out line 2675 in  rebuild, and rerun the  | 
| Yeah the 9950x is bleeding edge ARCH box $ lscpu | grep name
Model name:                              AMD Ryzen 9 9950X 16-Core Processor
$ ./build/bin/llama-sweep-bench --version
version: 3798 (255c2204)
built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu
$ lscpu | grep name
Model name:                           AMD Ryzen Threadripper PRO 7965WX 24-Cores
$ ./build/bin/llama-sweep-bench --version
version: 3798 (255c2204)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnuGot it, commenting this out on the 9950x and trying again adding one new plot to the above graph. So it is now much faster, albeit still a bit below the Q8_0_R8.   👈 DetailsQ8_K_R8 9950X 16x Core Comment Out mul_mat_q8_k_r8_q8_k<16>;
 | 
| OK, this is much better, but not sure what to do with it. On my 7950X commenting out line 2675 leads to ~5% lower performance, I suspect this should be similar on your 7965WX. My best guess is that the compiler is misbehaving. The  The  | 
| 
 When I get a chance I'll try to compile with an older version and test again. And yes this 9950X is one of the first to get that new 512-bit instruction, kinda cool to see it making a noticeable difference. | 
| OK, I think I'll merge this the way it is. I did try a few things but nothing resulted in an improvement (PPL and/or performance), so this is what it will be. | 


Motivation
IQ2_K/IQ2_S(2.4375 bpw) andIQ3_XXS(3.0625 bpw) orIQ3_KT(3.125 bpw) is quite large.Q2_K(2.625 bpw), which should normally fill the gap, is a lower quality quantization type, so the gap remains unfilled. Hence, it would be useful to have a high quality quantization type that is about in the middle betweenIQ2_KandIQ3_XXS.IQ2_K, IQ2_SandQ2_Kall use blocks of 16, so there isn't a high CUDA PP performance quantization type in that bpw range.IQ2_XXS, IQ2_KTandIQ2_KTall have good CUDA PP performance, but they use2.0625/2.125/2.1875bpw, so are in a different quantization quality league as quantization errors increase very rapidly with decreasing bpw in that range.UD_Q2_K_XLmodels have become very popular as for many people the resulting size is pretty much the maximum they can do with their hardware, while the quantization quality is closer to being really useful than smaller variants. Hence, a higher quality alternative toQ2_Kwith approximately the same bpw could become the goto quantization type for many users.Based on these observations and popular demand (hahaha, @Nexesenex was the only one asking for it), I decided to add
IQ2_KL, a 2.6875 bpw quantization type with much better quality than the 2.625 bpwQ2_K.Some details
I wanted to have blocks of 32 for good CUDA PP performance (see above). Spending 5-6 bits per block scale leaves about 2.5 bpw for the quants if we want to be in the 2.6-2.7 bpw range, which disables the option of a direct int -> weight mapping. I did not want to use a full-fledged codebook as in the i-quans, as this kills CPU performance. But pairs of quants have 5 bits available, which corresponds to 32 distinct 2D points, which is still in the range that can be handled on the CPU via fast shuffle instructions (two
vqtbl2q_s8instructions on NEON, 4_mm256_shuffle_epi8instructions and two blends onAVX2). On CUDA this would need two lookups + shift/or to assemble a 32-but integer that can be used inint8_tdot products, so also looking promising. So, then, 32 points in the 2D plane it is.How do we get these 32 points? Here is what I did:
IQ3_KS, which uses 3 bits for the quants, so 6 bits per pair, so 64 distinct possibilities.examples/quantize-stats/quantize-stats.cppis minimized. Here,$d^2(x_i, G)$  is the minimum distance between the point $x_i$  and any point on the grid $G = { g_i }$ . Initially I wanted to have an elegant approach for finding the optimum solution, but at the end I just brute-forced it, so not publishing this code. The 
IQ3_KSvalues are non-uniformly distributed in[-63, 47], and the resulting grid of 32 points looks quite interesting:In this solution the locations of the grid points coincide with the$F$ . However, when implemented in the quantization code, this alternative approach resulted in a higher quantization errors than what we get from the grid in the above figure, so I did not use that. My hand wavy explanation is that, when quantizing, we start with first finding an 
IQ3_KSnon-linear values. I did experiment with a grid where the points can take arbitraryint8_tvalues and this gives a lower value forIQ3_KSsolution, and then forcing the points not on the grid to a neighboring grid point, which kind of favors a grid where the grid points have the same co-ordinates as theIQ3_KSnon-linear values.Quantization quality
I have done a fair share of experiments with this new quantization type with pretty good results, totally obliterating a similarly sized
Q2_Kquantization. But to not be told that "perplexity tells us nothing", I'm not adding these results here, and leaving it up to "quant cookers" to evaluate quantization quality in their favorite way. @Nexesenex, who apparently has been following the commits while I was working on the PR, has a comment herePerformance
I'll compare to
Q2_K, the quantization type thatIQ2_KLis looking to replace, andIQ2_S, ani-quantrepresentative of slightly lower bpw. Using LlaMA-3.1-8B as an example with "pure" quantization (everything isQ2_K/IQ2_KLexcept for the output and token embedding tensors, which areQ8_0). The platforms used areCUDA: RTX-4080Zen4: Ryzen-7950XAVX2: Ryzen-5975WXNEON: M2-Max CPUMetal: M2-Max 30-core GPU