Replies: 1 comment
-
Good point. See #5060 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
On a Llama 70b model, the old Q2_K allowed to gain 600MB compared to the Q3_K_S, for a minor bump in perplexity (< 1%), and that is precious for the 36GB VRAM users like me. I imagine that for lesser size of models, it can matter for users with less RAM as well.
Considering that it was working well, and that we now have XS & XXS quants, could we have it back in the form of a Q3_K_XS, @ikawrakow , and even have an intermediate Q3_K_XXS to fill the gap with the Q2_K & lower incrementation ?
Some 70b models with 32k context capabilities start to appear, and it exists in smaller size also with various context lengths, and such granularity would be a great way to exploit them.
Beta Was this translation helpful? Give feedback.
All reactions