Replies: 4 comments 1 reply
-
I agree - this is a very interesting area for experiments. User @xaedes has laid the foundation for training with the baby-llama example and is also making very interesting progress at full-text training: ggerganov/ggml#8 (comment) CPU-based training / fine-tuning with quantization support could be very useful since it is much easier to afford a >128GB machine. We also have the mechanism to offload part of the computations to the GPU if necessary to get a bit of extra performance. In general, it looks like we have a good opportunity for demonstrating
The author actually acknowledged that GPTQ quantization is superior to NF4: https://twitter.com/Tim_Dettmers/status/1661482614811918338 |
Beta Was this translation helpful? Give feedback.
-
Yes, they did find the "super-block" or, as they call it, "double quantization" trick. But based on their numbers (model sizes after quantization), it does not look like they have found the best strategy. I haven't come around to be putting this stuff into The graph shows perplexity as a function of quantized model size. At the end of the day, to me it looks like what matters most is how many bits were spent in total and how these bits were distributed between the various tensors. Minimizing some measure of difference between the original and quantized model weights does help, but it is a second order effect. I have added the current You mention 3-bit quantization. Here is another graph where my current 3-bit results are included in orange. |
Beta Was this translation helpful? Give feedback.
-
I had the same thought too, so I implemented outliers per super-block some time ago. Here is a graph comparing 3-bit quantization with and without outliers: The black like is without using outliers. The point furthest to the left at about As we can see from the graph, separating outliers does improve accuracy, but it does so at a less efficient rate compared to using more bits for some of the tensors. On the other hand this implementation is far from perfect. My observation is that "outliers" tend to cluster in some portions of the tensors, so most of the time we are basically wasting bits in super-blocks that don't have real "outliers", while not being able to encode all real outliers in some super-blocks having more than |
Beta Was this translation helpful? Give feedback.
-
I am getting this issue " No GPU found. A GPU is needed for quantization." for the following code snippet trying to be implemented in M2 MacOS that has 12 CPU and 38 GPU in it. How is QLORA/quantization will work on M2 MacOS systems that has "mps" in it? import torch model_id = "EleutherAI/gpt-neox-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) #, device_map={"":0}) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device=torch.device('cpu')) |
Beta Was this translation helpful? Give feedback.
-
It would be interesting to compare this approach to the quantization in llama.cpp:
https://huggingface.co/blog/4bit-transformers-bitsandbytes
As I understand it, the main idea is to fine tune the model with a LoRA on each layer after 4-bit quantization, to restore performance to pre-quantization levels.
This could probably be applied to a GGML quantized model as well - either by doing the actual fine tuning in GGML or by training in Python and exporting the LoRA.
Some additional techniques they claim helps generation quality:
NormalFloat quantization
This sounds similar to what @MarcioPais experimented with in #397 (comment), where they said:
It is interesting that the paper calls this out as a clear improvement. Some possibilities I can think of:
Double Quantization
Very similar to the super blocks @ikawrakow uses in #1256. The paper uses a 8-bit scale value for every 64 4-bit weights, and a 32-bit scale for every 256 8-bit scales.
Other notes
They don't show any results for 3-bit quantization, seems like an obvious next step.
Beta Was this translation helpful? Give feedback.
All reactions