[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

chenqianfzh · 2024-05-12T23:18:39Z

QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit using bitsandbytes quantization, pair with a low-rank but higher-precision Low-Rank weight matrix to generate output.

This MR is the first step in supporting QLoRA in vLLM. With the PR, the Qlora author's open model on hugging face, such as, is supported:

https://huggingface.co/timdettmers/qlora-flan-7b (Its corresponding large model is "huggyllama/llama-7b")

User can run with or without a QLoRA adapter.

So far, only llama as a basic model is supported. More to come in the future. As explained below, special consideration is made for extensibility to future changes and other models.
Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.

Explanation on Changes

Modified files mainly include

Modify vllm/config.py, vllm/engine/arg_utils.py: Add new CLI parameters for QLoRA/bitsandbytes. The new parameter is:
- qlora_adapter_name_or_path : the path to the adpater repo. Could be empty.
Modify vllm/model_loader/loader.py: Define a new loader class, which will quantize the weight using bitsandbytes during loading
Modify vllm/model_executor/layers/linear.py: Add the logic of concatenate tensor in bitsandbytes in the weight_loader () function of QKVParallelLinear class and MergedColumnParallelLinear class.

The newly added files are:

VLLM/model_executor/layers/quantization/bitsandbytes.py: Here, similar to other quantization methods, we define two classes, class BitsAndBytesConfig (QuantizationConfig) and class BitsAndBytesLinearMethod (LinearMethodBase).
Examples/qlora_inference.py: Demonstration of the use of bitsandbytes, both with and without an adapter.

jeejeelee · 2024-05-13T02:21:48Z

ping @Yard1

Yard1

This completely bypasses the existing LoRA logic and implements its own. I don't think this is a good design and it clashes with already existing code. We should instead modify the LoRA support already present in vLLM to support QLoRA - it should also allow us to reuse a lot of existing code.

chenqianfzh · 2024-05-13T17:38:58Z

This completely bypasses the existing LoRA logic and implements its own. I don't think this is a good design and it clashes with already existing code. We should instead modify the LoRA support already present in vLLM to support QLoRA - it should also allow us to reuse a lot of existing code.

Thanks for your reply. You are not the first one who popped this concern. Actually, I asked myself the same question. :-)

I considered about re-use of LoRA at the first place. I have to start a new set of code because:

the existing LoRA in vLLM is implementing punica (https://github.com/punica-ai/punica), a multi-tenant scenario of LoRA. A lot of effort have made on the LoRA manager, which manages the cases where different sets of fine-tune weights using the same basic models.

But QLoRA, though carring a very similar name, work for a totally different scenario, thus unable to re-use the existing code of LoRA in vLLM.

punica is based on cuda code of BGMV, and BGMV does not support any quantization. But in QLORA, quantization of basic model is the keypoint in saving memory. This is another reason I had to deviate away from reusing LoRA.
On the other hand, QLoRA use a different set of cuda code. The author of QLORA provides the Cuda implementation of QLORA implemention and packed in the python package of bitsandbytes, which is used in the QLORA implementation of huggingface transformers package. So I moved away from re-using the LoRA code.

How about I add some comments somewhere to clarify your concern?

Yard1 · 2024-05-13T17:48:39Z

Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?

chenqianfzh · 2024-05-13T21:29:21Z

Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?

I am not sure what you mean by "at will". Do you mean load/unload during runtime?

In this implementation, user can load an adpater by specfiying "qlora_adapter_name_or_path" in parameter when starting the inference. User can also run without an adapter by leaving the above parameter empty.

However, the user cannot switch the adapter during the runtime. Switching adapter is not a scenario supported in the QLoRA design.

The main goal of QLoRA is to use to LoRA weights to compensate the loss caused by the 4-bit quantization in the basic model. So it is a quantization technique. Switching LoRA to support different fine-tune scenarios, as in punica, is not in its design goals.

Yard1 · 2024-05-13T21:57:59Z

Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:

for consistency, I would suggest ditching the qlora_supported decorator and just specify the class attribute directly on the class
we should avoid the if model_config.quantization == "qlora": pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add a QLoRAModelLoader which can subclass/compose DefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of the Special case for Quantized Weights. in linear layer implementation)

chenqianfzh · 2024-05-14T00:23:08Z

Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:

for consistency, I would suggest ditching the qlora_supported decorator and just specify the class attribute directly on the class

we should avoid the if model_config.quantization == "qlora": pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add a QLoRAModelLoader which can subclass/compose DefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of the Special case for Quantized Weights. in linear layer implementation)

Thanks for the suggestion. I will make the changes as suggested.

Cheers!

jeejeelee · 2024-05-14T02:49:09Z

Thank you for your excellent work. Here are some personal opinions:

vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.
For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named bitsandbytes (e.g., BAB+LoRA), refer to [Feature]: bitsandbytes support #4033, and then we can reuse the current LoRA logic.
Regardless of LoRA or QLoRA, Punica can support these

If I am wrong, please correct me directly, Thanks again.

Cheers!

chenqianfzh · 2024-05-14T20:57:38Z

Thank you for your excellent work. Here are some personal opinions:

vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.

For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named bitsandbytes (e.g., BAB+LoRA), refer to [Feature]: bitsandbytes support #4033, and then we can reuse the current LoRA logic.

Regardless of LoRA or QLoRA, Punica can support these

If I am wrong, please correct me directly, Thanks again.

Cheers!

I re-read the LoRA code carefully and saw that quantization is supported in LoRA now. It was not supported when I started my design and coding. Sorry for the miss.

I will re-think my design again based on this change, as well as Yard1's suggestions.

Thanks & Happy Coding!

chenqianfzh · 2024-05-23T01:03:44Z

@Yard1 @jeejeelee

I just updated the MR of QLoRA/BitsAndBytes with the changes suggested. Could you please take another look?

Thanks for the great advice from you. Learned a lot and improved a lot. :-)

BTW, I hit a lot of yapf errors in CI/CD. I found the the yapf errors are not from me. Should I just ignore it?

jeejeelee · 2024-05-23T02:33:45Z

@chenqianfzh We cannot igore format error, you can run bash format.sh to check for format errors

Yard1

Thanks, this is looking much cleaner! Left some comments, hope they will be useful.

vllm/model_executor/model_loader/loader.py

vllm/engine/arg_utils.py

vllm/model_executor/layers/quantization/bitsandbytes.py

vllm/model_executor/model_loader/loader.py

Yard1 · 2024-05-23T07:25:31Z

We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)

requirements-common.txt

chenqianfzh · 2024-05-23T16:27:33Z

@mgoin @Yard1 @jeejeelee

Thanks for the feedback. Working on the changes now.

chenqianfzh · 2024-05-24T06:43:17Z

We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)

the newly added file examples/qlora_inference.py is created for this purpose. In this file, both the case that bitsandbytes quantization with/withou LoRA adpaters are tested.

Here are the ouput I got in my local test ( of the four, the last is without a LORA adapter , the other three are with adpaters:

--------------------------------------------------------------------------
Prompt: The capital of France is 
Output:  Paris.
--------------------------------------------------------------------------
Prompt: The capital of USA is 
Output:  Washington DC.
--------------------------------------------------------------------------
Prompt: my name is 
Output:  john and i am a 20 year old male. i am a student at the university of maryland. i am a sophomore and i am majoring in business. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party.
--------------------------------------------------------------------------
Prompt: My name is 
Output:  Kyle and I am a 20 year old college student. I am a huge fan of the outdoors and love to hike, camp, and fish. I am a very active person and love to stay busy. I am a very outgoing person and love to meet new people. I am a very easy going person and love to have fun. I am a very hard worker and love to work. I am a very trustworthy person and love to help people. I am a very caring person and love to help people. I am a very respectful person and love to respect others. I am a

Yard1 · 2024-05-24T17:15:20Z

@chenqianfzh example is fine, but we need an automated pytest test to run in CI to prevent regressions.

jeejeelee · 2024-05-25T02:36:59Z

@chenqianfzh Can we add more quantization type examples in qlora_example.py, such as GPT+LoRA, so that users can refer to this script to learn how to utilize LoRA on quantized model, thanks

vllm/model_executor/models/llama.py

chenqianfzh · 2024-06-01T06:16:28Z

@mgoin Thanks for reviewing the PR!

I updated the code per your comments. Could u have another check?

…roject#4776)

sajadn · 2024-06-13T21:01:40Z

Hey,
Thanks for the feature.

why do you make sure 'lm_head' is not quantized in your tests while peft accepts 'lm_head' among the target_modules? I was trying to run inference for a model fine-tuned with qlora and I get the following error:

File "/opt/conda/envs/llm/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 436, in load_weights
    param = params_dict[name]
KeyError: 'lm_head.qweight'

mgoin · 2024-06-13T21:35:20Z

Hey @sajadn, it may be related to the precedent of GPTQ not having quantized lm_head. See this PR for possibly more context #4442 cc @robertgshaw2-neuralmagic

…roject#4776)

devlup · 2024-07-01T17:31:25Z

hi #4776 @chenqianfzh it seems the bitsandbytes test is not working with LLama 3 , can you retest

…roject#4776)

VirgilG72 · 2024-08-23T15:09:13Z

Hello, I am appreciated for your excellent work. However, I have noticed about that "Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.". So, May I ask when this job is expected to be supported in the future？

QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit using bitsandbytes quantization, pair with a low-rank but higher-precision Low-Rank weight matrix to generate output.

This MR is the first step in supporting QLoRA in vLLM. With the PR, the Qlora author's open model on hugging face, such as, is supported:

https://huggingface.co/timdettmers/qlora-flan-7b (Its corresponding large model is "huggyllama/llama-7b")

User can run with or without a QLoRA adapter.

So far, only llama as a basic model is supported. More to come in the future. As explained below, special consideration is made for extensibility to future changes and other models. Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.

Explanation on Changes

Modified files mainly include

Modify vllm/config.py, vllm/engine/arg_utils.py: Add new CLI parameters for QLoRA/bitsandbytes. The new parameter is:

qlora_adapter_name_or_path : the path to the adpater repo. Could be empty.

Modify vllm/model_loader/loader.py: Define a new loader class, which will quantize the weight using bitsandbytes during loading

Modify vllm/model_executor/layers/linear.py: Add the logic of concatenate tensor in bitsandbytes in the weight_loader () function of QKVParallelLinear class and MergedColumnParallelLinear class.

The newly added files are:

VLLM/model_executor/layers/quantization/bitsandbytes.py: Here, similar to other quantization methods, we define two classes, class BitsAndBytesConfig (QuantizationConfig) and class BitsAndBytesLinearMethod (LinearMethodBase).

Examples/qlora_inference.py: Demonstration of the use of bitsandbytes, both with and without an adapter.

VirgilG72 · 2024-08-26T06:53:39Z

Hello, I am appreciated for your excellent work. However, I have noticed about that "Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.". So, May I ask when this job is expected to be supported in the future？

QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit using bitsandbytes quantization, pair with a low-rank but higher-precision Low-Rank weight matrix to generate output.
This MR is the first step in supporting QLoRA in vLLM. With the PR, the Qlora author's open model on hugging face, such as, is supported:

https://huggingface.co/timdettmers/qlora-flan-7b (Its corresponding large model is "huggyllama/llama-7b")

User can run with or without a QLoRA adapter.
So far, only llama as a basic model is supported. More to come in the future. As explained below, special consideration is made for extensibility to future changes and other models. Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.
Explanation on Changes
Modified files mainly include

Modify vllm/config.py, vllm/engine/arg_utils.py: Add new CLI parameters for QLoRA/bitsandbytes. The new parameter is:

qlora_adapter_name_or_path : the path to the adpater repo. Could be empty.

Modify vllm/model_loader/loader.py: Define a new loader class, which will quantize the weight using bitsandbytes during loading

Modify vllm/model_executor/layers/linear.py: Add the logic of concatenate tensor in bitsandbytes in the weight_loader () function of QKVParallelLinear class and MergedColumnParallelLinear class.

The newly added files are:

VLLM/model_executor/layers/quantization/bitsandbytes.py: Here, similar to other quantization methods, we define two classes, class BitsAndBytesConfig (QuantizationConfig) and class BitsAndBytesLinearMethod (LinearMethodBase).

Examples/qlora_inference.py: Demonstration of the use of bitsandbytes, both with and without an adapter.

@chenqianfzh

chenqianfzh · 2024-08-26T07:06:09Z

Hello, I am appreciated for your excellent work. However, I have noticed about that "Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.". So, May I ask when this job is expected to be supported in the future？

Actually, TP on bnb has partially done in #5813. Yet the community decided to provide support to FP4 and 8bit first. I will rebase the work PR 5813 after the PR #7445 is merged.

Hopefully it will be quick. :-)

And, thanks for your checking. I hope I can share my update with you soon.

VirgilG72 · 2024-08-26T10:12:12Z

Hello, I am appreciated for your excellent work. However, I have noticed about that "Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.". So, May I ask when this job is expected to be supported in the future？

Actually, TP on bnb has partially done in #5813. Yet the community decided to provide support to FP4 and 8bit first. I will rebase the work PR 5813 after the PR #7445 is merged.

Hopefully it will be quick. :-)

And, thanks for your checking. I hope I can share my update with you soon.

Thank you for your reply. Looking forward to your future work!

…roject#4776)

chenqianfzh force-pushed the qian/qlora branch 2 times, most recently from 05dbfb7 to 5e868f7 Compare May 13, 2024 00:44

jeejeelee mentioned this pull request May 13, 2024

[Feature]: bitsandbytes support #4033

Closed

Yard1 requested changes May 13, 2024

View reviewed changes

chenqianfzh force-pushed the qian/qlora branch from 5e868f7 to 6e74420 Compare May 23, 2024 00:43

Yard1 reviewed May 23, 2024

View reviewed changes

mgoin reviewed May 23, 2024

View reviewed changes

requirements-common.txt Outdated Show resolved Hide resolved

chenqianfzh force-pushed the qian/qlora branch 8 times, most recently from 523c053 to 0ab5879 Compare May 28, 2024 06:04

mgoin reviewed Jun 1, 2024

View reviewed changes

vllm/model_executor/models/llama.py Show resolved Hide resolved

update per comments

e16bcb6

Yard1 approved these changes Jun 1, 2024

View reviewed changes

mgoin approved these changes Jun 1, 2024

View reviewed changes

mgoin merged commit b9c0605 into vllm-project:main Jun 1, 2024
65 checks passed

blinkbear pushed a commit to blinkbear/vllm that referenced this pull request Jun 3, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

50df7af

…roject#4776)

XiaoningDing mentioned this pull request Jun 4, 2024

Q2 Roadmap bd-iaas-us/vllm#2

Closed

17 tasks

chenqianfzh mentioned this pull request Jun 10, 2024

[Feature]: implementation of QLoRA on VLLM bd-iaas-us/vllm#9

Closed

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 11, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

cb6b7a0

…roject#4776)

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

04af8d9

…roject#4776)

This was referenced Jun 25, 2024

[distributed][kernel]support tensor-parallelism in bitsandbytes quant… #5813

Closed

[bitsandbytes]: support read bnb pre-quantized model #5753

Merged

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

006666a

…roject#4776)

danielhanchen mentioned this pull request Jul 2, 2024

unsloth 4bit models do not load in vLLM - says missing adapter path or name unslothai/unsloth#688

Open

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

eebe0fd

…roject#4776)

JJJJerry mentioned this pull request Jul 10, 2024

使用vllm时支持bitsandbytes量化 hiyouga/LLaMA-Factory#4751

Open

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

25a4042

…roject#4776)

chenqianfzh deleted the qian/qlora branch August 30, 2024 00:53

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

7bb7705

…roject#4776)

noooop mentioned this pull request Jan 6, 2025

[Feature]: Support Inflight quantization: load as 8bit quantization. #11655

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

chenqianfzh commented May 12, 2024 •

edited

Loading

jeejeelee commented May 13, 2024

Yard1 left a comment

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024 •

edited

Loading

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024

chenqianfzh commented May 14, 2024

jeejeelee commented May 14, 2024

chenqianfzh commented May 14, 2024 •

edited

Loading

chenqianfzh commented May 23, 2024

jeejeelee commented May 23, 2024

Yard1 left a comment

Yard1 commented May 23, 2024

chenqianfzh commented May 23, 2024

chenqianfzh commented May 24, 2024

Yard1 commented May 24, 2024

jeejeelee commented May 25, 2024

chenqianfzh commented Jun 1, 2024

sajadn commented Jun 13, 2024

mgoin commented Jun 13, 2024

devlup commented Jul 1, 2024

VirgilG72 commented Aug 23, 2024

VirgilG72 commented Aug 26, 2024

chenqianfzh commented Aug 26, 2024

VirgilG72 commented Aug 26, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

Conversation

chenqianfzh commented May 12, 2024 • edited Loading

jeejeelee commented May 13, 2024

Yard1 left a comment

Choose a reason for hiding this comment

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024 • edited Loading

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024

chenqianfzh commented May 14, 2024

jeejeelee commented May 14, 2024

chenqianfzh commented May 14, 2024 • edited Loading

chenqianfzh commented May 23, 2024

jeejeelee commented May 23, 2024

Yard1 left a comment

Choose a reason for hiding this comment

Yard1 commented May 23, 2024

chenqianfzh commented May 23, 2024

chenqianfzh commented May 24, 2024

Yard1 commented May 24, 2024

jeejeelee commented May 25, 2024

chenqianfzh commented Jun 1, 2024

sajadn commented Jun 13, 2024

mgoin commented Jun 13, 2024

devlup commented Jul 1, 2024

VirgilG72 commented Aug 23, 2024

VirgilG72 commented Aug 26, 2024

chenqianfzh commented Aug 26, 2024

VirgilG72 commented Aug 26, 2024

chenqianfzh commented May 12, 2024 •

edited

Loading

Yard1 commented May 13, 2024 •

edited

Loading

chenqianfzh commented May 14, 2024 •

edited

Loading