-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GGUF support #1002
Comments
FYI, high throughput is hard when using quantized models in general, regardless of which framework. But if you can manage to run with batch size (data parallelism) of less than 8 or ideally less than 4, then it will increase throughput. At 16, there is no gain with quantized models because of how quantized inference works. |
Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits? |
Not yet. It’s 4-bit only at the moment. |
Thank you. Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me. I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details. |
Yes. The way quantization works is that weights are quantized to 4 bits. Then at inference time, you run dequantization to FP16 to be able to perform matrix multiplication. This dequantization is the essence of any quantized model and is why quantized models generally struggle with large batch sizes because it becomes compute-bound doing dequantization rather than the actual matrix multiplication. At batch size 1, you are memory bound, which speeds up your inference by a great deal. |
On your documentation page there is an excellent high level summary on how to add support for a new model. Could you please write down a few bullet points on where to look in the code (high level) if I want to add a new input format (GGUF)? I would start by searching for all the code involved in reading model configuration and parameters. I'm also aware that for 8 bit operation at least the dequantization code will need to be extended as well. Also, what do you think about AWQ? Should we expect 8 bit support be added to AWQ in the near future? Because that would make any work on GGUF pointless. If I would work on GGUF support it would certainly be based on the AWQ branch. Which is the correct branch to look at? (There seem to be multiple and I'm a bit confused.) Thank you! |
8-bit GGUF could be a perfect input format for the upcoming W8A8 inference mode, see #1112 |
Could you please elaborate more on this or point to some code? |
The amount of work scales linearly. The problem is when you increase batch size too much because then your GPU will be 100% utilized just doing matrix multiplication. Once that happens, the dequantization overhead will start showing in how fast you can run inference compared to FP16. This is also why finding algorithms that can quantize to W4A4 (very hard) or W8A8 (hard) is essential for higher throughput because you remove the need for dequantization since you can run natively on tensor cores. |
I understand the motivation of W4A4 and W8A8 as everything can be done solely in INT4/INT8.
if all the above blocks scale linearly(performance-wise) with batch size the second scenario must always be faster for any batch size if it's faster for |
This is the difference between memory-bound and compute-bound. At small batch sizes, you are memory bound meaning that you are limited by how fast you can pass the model’s weights around. This makes quantized models faster. However, when we talk about large batch sizes, we move away from being memory bound. It is no longer an issue of how fast we can transport the weights but rather how much time we spend doing computations. You have to think of it as being stuck computing matmuls rather than waiting for weights to be transported through memory. |
@casper-hansen thanks for the clarification. I'm still trying to connect the dots. Is |
Weight loading happens at startup time and then it’s transported through registers. This process is not really transparent but it all happens in the quantization kernel that you can find in csrc. |
no GGUF support? |
+1 GGUF |
+1 for gguf please |
Slowly we should go for EXL2 instead :) |
Over the past two weeks, while I was learning the llama.cpp code and simultaneously writing a small repo running GGUF inference in Pytorch, I thought it would be better to make vLLM work too. Through a process of trial and error, I've managed to develop a preliminary draft of the GGUF support, which you can find in the gguf branch. As of now, It only works for llama and mixtral. First convert the gguf to torch state dict and tokenizer file using the code in the
Then start up the vllm server as usual
Note:
|
Great job, I will definitely try it as time allows. My primary use case is running DeepSeek Coder 33B (alternatively CodeLlama 34B) with The A16 likely has better quality (correctness) anyway, so that's a good first choice. If this approach works well with GGUF, then supporting the EXL2 format may work as well. |
+1 |
1 similar comment
+1 |
I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later. |
can you please check if vllm can inter miqu-1-70b-sf-gptq correctly? it reports oom on my machine. (other 70b gptq model like qwen are just ok) |
I have been trying the below command with vLLM version 0.3.0 in Linux Ubuntu CPU machine. python3 -m vllm.entrypoints.api_server --root-path models/ --model llama-2-7b.Q5_K_M.gguf --host 0.0.0.0 --port 8080 Facing the error as below, any help please. File "/home/ubuntu/ragas/lib/python3.10/site-packages/transformers/utils/hub.py", line 406, in cached_file |
hi, is it support for mistral gguf?. just tried but got error |
Yes I tested mistral gguf with no problem. Please be ware that you have to install the custom branch from source instead of the official build. As an alternative, you can use aphrodite-engine too, which also integrates GGUF support and is easier to install. |
+1 gguf support please! For those of us that have downloaded a large archive of gguf models, it would be a great benefit to use the vLLM project with the artifacts we have already downloaded and available, rather than downloading fp16 or awq and consuming more disk resources. |
1 similar comment
+1 gguf support please! For those of us that have downloaded a large archive of gguf models, it would be a great benefit to use the vLLM project with the artifacts we have already downloaded and available, rather than downloading fp16 or awq and consuming more disk resources. |
+1 gguf support please |
1 similar comment
+1 gguf support please |
+1 please |
1 similar comment
+1 please |
+1 gguf support please |
+1 guuf support please!! |
+1 :) |
+1 |
1 similar comment
+1 |
gguf support please!! |
+1 for support. Thanks for working on this @Isotr0py. |
+1 |
1 similar comment
+1 |
+1 |
+1 |
@mgoin UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 8: invalid continuation byte Do you thing it's better to re-open this feature request? Or have you ever saw this error? |
@vbiral I can run the example code in a new conda environment with latest nightly wheel. |
@Isotr0py The strange thing is that I tried with the model that you used in the example and with the bullerwins/Meta-Llama-3.1-70B-Instruct-GGUF. That's why I thought that maybe this was not a problem in the file. In the end of the traceback I got this error too: OSError: It looks like the config file at '/home/victor.placido/.cache/huggingface/hub/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf' is not a valid JSON file. I'm running in a linux environment, will try in a Mac to check if the error continues. |
@vbiral Are you using the released 0.5.4 version wheel? export VLLM_VERSION=0.5.4 # vLLM's main branch version is currently set to latest released tag
pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
# You can also access a specific commit
# export VLLM_COMMIT=...
# pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl |
@Isotr0py |
Motivation
AWQ is nice, but if you want more control over the bit depth (thus VRAM usage), then GGUF may be a better option. A wide range of models are available from TheBloke at various bit depths, so everybody can use the biggest one which can fit into their GPUs.
I cannot find a high-throughput batch inference engine which can load GGUF, maybe there is none. (vLLM cannot load it either.)
Related resources
https://github.com/ggerganov/llama.cpp
https://huggingface.co/TheBloke
The text was updated successfully, but these errors were encountered: