-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526
base: master
Are you sure you want to change the base?
Conversation
@@ -128,7 +128,7 @@ def forward( | |||
input = input[0] | |||
input_type = input.dtype | |||
|
|||
if (self.config.fp16 or self.config.q_int8) \ | |||
if (self.config.fp16 or self.config.qunatize) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (self.config.fp16 or self.config.qunatize) \ | |
if (self.config.fp16 or self.config.quantize) \ |
unrelated but you might be interested in borrowing ideas from smoothquant which seems to enable more performant quantisation, especiallly faster inference
While the concept was applied to INT8, I don't see why it couldn't be applied to INT4 |
@RezaYazdaniAminabadi Is there an updated version of this PR? I'm having some issues running out of the box on CUDA using the following related branch in transformers: |
This PR adds the support for 4-bit quantization at DeepSpeed-Inference to be able to run large-scale model such as BLOOM-176B using 2x/4x lower number of GPUs compared to INT8/FP16 inference pipeline.
As the first evaluation of the accuracy on 2 A100 GPUs, we see some good quality text generated using the below prompt:
How to run inference
You can find the running scripts here. For running the model using 4-bits, you can start with a FP16 or INT8 checkpoint, and the DeepSpeed-Inference pipeline can generate the 4-bit checkpoint on-the-fly and run the inference on as low as 2 A100-80G GPUs. Here is the command used to generate the above text:
Here is the performance stats for running on 2 A100 GPUs:
Compared to 4 GPU int8-performance, the latency increases from 160.5 ms to 283.4 ms for batch-1 while reducing the number of GPUs by 2x, which reduces the inference cost by about 13%.
We will add more performance results to check the throughput improvement.
More to come
cc: @jeffra @yaozhewei @cmikeh2