-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776
Conversation
05dbfb7
to
5e868f7
Compare
ping @Yard1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This completely bypasses the existing LoRA logic and implements its own. I don't think this is a good design and it clashes with already existing code. We should instead modify the LoRA support already present in vLLM to support QLoRA - it should also allow us to reuse a lot of existing code.
Thanks for your reply. You are not the first one who popped this concern. Actually, I asked myself the same question. :-) I considered about re-use of LoRA at the first place. I have to start a new set of code because:
But QLoRA, though carring a very similar name, work for a totally different scenario, thus unable to re-use the existing code of LoRA in vLLM.
How about I add some comments somewhere to clarify your concern? |
Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will? |
I am not sure what you mean by "at will". Do you mean load/unload during runtime? In this implementation, user can load an adpater by specfiying "qlora_adapter_name_or_path" in parameter when starting the inference. User can also run without an adapter by leaving the above parameter empty. However, the user cannot switch the adapter during the runtime. Switching adapter is not a scenario supported in the QLoRA design. The main goal of QLoRA is to use to LoRA weights to compensate the loss caused by the 4-bit quantization in the basic model. So it is a quantization technique. Switching LoRA to support different fine-tune scenarios, as in punica, is not in its design goals. |
Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:
|
Thanks for the suggestion. I will make the changes as suggested. Cheers! |
Thank you for your excellent work. Here are some personal opinions:
If I am wrong, please correct me directly, Thanks again. Cheers! |
I re-read the LoRA code carefully and saw that quantization is supported in LoRA now. It was not supported when I started my design and coding. Sorry for the miss. I will re-think my design again based on this change, as well as Yard1's suggestions. Thanks & Happy Coding! |
I just updated the MR of QLoRA/BitsAndBytes with the changes suggested. Could you please take another look? Thanks for the great advice from you. Learned a lot and improved a lot. :-) BTW, I hit a lot of yapf errors in CI/CD. I found the the yapf errors are not from me. Should I just ignore it? |
@chenqianfzh We cannot igore format error, you can run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is looking much cleaner! Left some comments, hope they will be useful.
We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs) |
Thanks for the feedback. Working on the changes now. |
the newly added file examples/qlora_inference.py is created for this purpose. In this file, both the case that bitsandbytes quantization with/withou LoRA adpaters are tested. Here are the ouput I got in my local test ( of the four, the last is without a LORA adapter , the other three are with adpaters:
|
@chenqianfzh example is fine, but we need an automated pytest test to run in CI to prevent regressions. |
@chenqianfzh Can we add more quantization type examples in qlora_example.py, such as GPT+LoRA, so that users can refer to this script to learn how to utilize LoRA on quantized model, thanks |
523c053
to
0ab5879
Compare
@mgoin Thanks for reviewing the PR! I updated the code per your comments. Could u have another check? |
Hey, why do you make sure 'lm_head' is not quantized in your tests while peft accepts 'lm_head' among the target_modules? I was trying to run inference for a model fine-tuned with qlora and I get the following error:
|
hi #4776 @chenqianfzh it seems the bitsandbytes test is not working with LLama 3 , can you retest |
Hello, I am appreciated for your excellent work. However, I have noticed about that "Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.". So, May I ask when this job is expected to be supported in the future?
|
|
Actually, TP on bnb has partially done in #5813. Yet the community decided to provide support to FP4 and 8bit first. I will rebase the work PR 5813 after the PR #7445 is merged. Hopefully it will be quick. :-) And, thanks for your checking. I hope I can share my update with you soon. |
Thank you for your reply. Looking forward to your future work! |
QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit using bitsandbytes quantization, pair with a low-rank but higher-precision Low-Rank weight matrix to generate output.
This MR is the first step in supporting QLoRA in vLLM. With the PR, the Qlora author's open model on hugging face, such as, is supported:
User can run with or without a QLoRA adapter.
So far, only llama as a basic model is supported. More to come in the future. As explained below, special consideration is made for extensibility to future changes and other models.
Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.
Explanation on Changes
Modified files mainly include
The newly added files are: