-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LoRA fine-tuning to AWQ #85
Comments
I would love to add LoRA and make AutoAWQ compatible with PEFT. This is something that I have thought about but currently it’s more important for me to see what I can do a high throughput quantized model. |
Ok cool, I think supporting QLoRA merging is underappreciated though. I don't know of any way to do this and it means there isn't a good open source way to serve QLoRA tuned open source models. BTW, when you say high throughput, do you mean batch size larger than 8? so bf16 implementation? |
I could probably look into it during next week. Maybe the |
In general, I think we should integrate with PEFT. From my understanding, this requires our WQLinear modules to generate gradients during a backward pass - so you would have to implement that functionality. It may turn out to be easy enough since autograd works pretty well - maybe look to AutoGPTQ to see how they integrated with PEFT. |
@casper-hansen AutoGPTQ implements QLinear with various underlying QGEMM implementations (cuda, exllama, qigen, openai/triton) and most of them did not implement the backward kernel except for triton. The triton kernel is currently the only one could be used for training a quantized model in AutoGPTQ, though not the most optimal. FYI the autograd_4bit mentioned above simply unpacks the weights into fp and calls |
I welcome any work on a backward pass function for AWQ. There are many ways to go about it. Just keep in mind the AWQ kernel does not scale well with larger batch sizes, above batch size 16 and it will be slower than FP16. I found some code where someone did the backward pass: |
@casper-hansen FYI the above one still unpacks and gemms everything in fp...
I see the changes |
Yes, I see that, they dequantize to run FP16. I’m pretty sure this is normal for training? I created v2 based on their new GEMM kernel but it’s way slower and only compatible with GEMV where it processes the context. GEMV is 20% faster at small prompts but not great for high throughput or deployments. |
ime, triton was never faster for anything. exclusionary high compute requirements and slower speed, oh my. The only one who has pulled off merging adapters into quantized models is GGUF. With that alpaca_lora_4bit repo + extensions I can merge LoRA together but not to the model. |
AFAIK you can merge the LoRA weights and unquantised base model (even if you fine-tuned in 4/8 bit) using I guess this only really applies if you don't have the VRAM to train the model without PEFT though. |
@cassianlewis yeah in bnb but not gptq AFAIK. Not ideal to merge to unquantified either. |
Hi, I'm trying to do it with Mixtral, but i get the following output / error:
could anyone please help me out with this? |
If you merge a quantized (transformers) model then it will become a 4 or 8 bit model, which you can't then do AWQ on. Instead, you would need to reload a base model in 16 bit and merge your LoRA to that (using merge and unload). Then you can AWQ that merged model. More info in this vid. |
Having the base model becomes unmanageable with 70b+, that's part of the issue. They're 160gb+ |
Hi! any progress? is train LoRA modules with AWQ available now? |
Hi, I'm also interested to know whether LoRA + AWQ is already available now. Thanks! |
@RicardoHalak see this, is runnable huggingface/transformers#28987 |
It would be fantastic if we could add the ability to do LoRA fine-tuning and merging of adapters.
Background on QLoRA
The two common libraries I use are:
Why add LoRA to AWQ
The text was updated successfully, but these errors were encountered: