Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGUF support #1002

Closed
viktor-ferenczi opened this issue Sep 10, 2023 · 49 comments · Fixed by #5191
Closed

GGUF support #1002

viktor-ferenczi opened this issue Sep 10, 2023 · 49 comments · Fixed by #5191

Comments

@viktor-ferenczi
Copy link
Contributor

Motivation

AWQ is nice, but if you want more control over the bit depth (thus VRAM usage), then GGUF may be a better option. A wide range of models are available from TheBloke at various bit depths, so everybody can use the biggest one which can fit into their GPUs.

I cannot find a high-throughput batch inference engine which can load GGUF, maybe there is none. (vLLM cannot load it either.)

Related resources

https://github.com/ggerganov/llama.cpp

https://huggingface.co/TheBloke

@casper-hansen
Copy link
Contributor

FYI, high throughput is hard when using quantized models in general, regardless of which framework. But if you can manage to run with batch size (data parallelism) of less than 8 or ideally less than 4, then it will increase throughput. At 16, there is no gain with quantized models because of how quantized inference works.

@viktor-ferenczi
Copy link
Contributor Author

Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits?

@casper-hansen
Copy link
Contributor

Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits?

Not yet. It’s 4-bit only at the moment.

@viktor-ferenczi
Copy link
Contributor Author

Thank you.

Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me.

I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.

@casper-hansen
Copy link
Contributor

Thank you.

Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me.

I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.

Yes. The way quantization works is that weights are quantized to 4 bits. Then at inference time, you run dequantization to FP16 to be able to perform matrix multiplication. This dequantization is the essence of any quantized model and is why quantized models generally struggle with large batch sizes because it becomes compute-bound doing dequantization rather than the actual matrix multiplication. At batch size 1, you are memory bound, which speeds up your inference by a great deal.

@viktor-ferenczi
Copy link
Contributor Author

On your documentation page there is an excellent high level summary on how to add support for a new model.

Could you please write down a few bullet points on where to look in the code (high level) if I want to add a new input format (GGUF)?

I would start by searching for all the code involved in reading model configuration and parameters.

I'm also aware that for 8 bit operation at least the dequantization code will need to be extended as well.

Also, what do you think about AWQ? Should we expect 8 bit support be added to AWQ in the near future? Because that would make any work on GGUF pointless.

If I would work on GGUF support it would certainly be based on the AWQ branch. Which is the correct branch to look at? (There seem to be multiple and I'm a bit confused.)

Thank you!

@viktor-ferenczi
Copy link
Contributor Author

viktor-ferenczi commented Sep 21, 2023

8-bit GGUF could be a perfect input format for the upcoming W8A8 inference mode, see #1112

@sh1ng
Copy link
Contributor

sh1ng commented Nov 1, 2023

Thank you.
Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me.
I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.

Yes. The way quantization works is that weights are quantized to 4 bits. Then at inference time, you run dequantization to FP16 to be able to perform matrix multiplication. This dequantization is the essence of any quantized model and is why quantized models generally struggle with large batch sizes because it becomes compute-bound doing dequantization rather than the actual matrix multiplication. At batch size 1, you are memory bound, which speeds up your inference by a great deal.

Could you please elaborate more on this or point to some code?
At first glance, the amount of work is proportional to a batch size in both cases(dequantization, matrix-multiplication). Or at larger batches, matrix multiplication is implemented more efficiently (with some CUDA tricks) while dequantization stays the same?

@casper-hansen
Copy link
Contributor

casper-hansen commented Nov 1, 2023

The amount of work scales linearly. The problem is when you increase batch size too much because then your GPU will be 100% utilized just doing matrix multiplication. Once that happens, the dequantization overhead will start showing in how fast you can run inference compared to FP16.

This is also why finding algorithms that can quantize to W4A4 (very hard) or W8A8 (hard) is essential for higher throughput because you remove the need for dequantization since you can run natively on tensor cores.

@sh1ng
Copy link
Contributor

sh1ng commented Nov 1, 2023

I understand the motivation of W4A4 and W8A8 as everything can be done solely in INT4/INT8.
But what I don't fully understand is the following

load weights in fp16  -> matrix multiplication in fp16
load weights in int4 or int8 -> dequantization to fp16 -> matrix multiplication in fp16

if all the above blocks scale linearly(performance-wise) with batch size the second scenario must always be faster for any batch size if it's faster for batch_size=1. So I expect some non-linearity in one of the steps.

@casper-hansen
Copy link
Contributor

casper-hansen commented Nov 1, 2023

This is the difference between memory-bound and compute-bound. At small batch sizes, you are memory bound meaning that you are limited by how fast you can pass the model’s weights around. This makes quantized models faster.

However, when we talk about large batch sizes, we move away from being memory bound. It is no longer an issue of how fast we can transport the weights but rather how much time we spend doing computations. You have to think of it as being stuck computing matmuls rather than waiting for weights to be transported through memory.

@sh1ng
Copy link
Contributor

sh1ng commented Nov 1, 2023

@casper-hansen thanks for the clarification. I'm still trying to connect the dots.

Is loading->dequantization->multiplication fuzed into a single kernel? Could you point me to some source code?

@casper-hansen
Copy link
Contributor

Weight loading happens at startup time and then it’s transported through registers. This process is not really transparent but it all happens in the quantization kernel that you can find in csrc.

@ogcatt
Copy link

ogcatt commented Nov 26, 2023

no GGUF support?

@Namec999
Copy link

+1 GGUF

@delta-whiplash
Copy link

+1 for gguf please

@viktor-ferenczi
Copy link
Contributor Author

Slowly we should go for EXL2 instead :)

@chu-tianxiang
Copy link
Contributor

Over the past two weeks, while I was learning the llama.cpp code and simultaneously writing a small repo running GGUF inference in Pytorch, I thought it would be better to make vLLM work too. Through a process of trial and error, I've managed to develop a preliminary draft of the GGUF support, which you can find in the gguf branch. As of now, It only works for llama and mixtral.

First convert the gguf to torch state dict and tokenizer file using the code in the examples folder

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python convert_gguf_to_torch.py --input mixtral-8x7b-instruct-v0.1.Q2_K.gguf --output mixtral-q2k

Then start up the vllm server as usual

python -m vllm.entrypoints.api_server --model mixtral-q2k --quantization gguf

Note:

  1. llama.cpp quantizes both embedding layers and output layers. For simplicity now I de-quantized them before loading into model, but a more decent solution to load them natively is definitely needed.
  2. llama.cpp implements two sets of kernels, WxA8 (see #2067 and #2160) and WxA16. I haven't read the WxA8 cuda code yet and only ported the WxA16 part now, so the latency may be inferior.

@viktor-ferenczi
Copy link
Contributor Author

viktor-ferenczi commented Jan 19, 2024

Great job, I will definitely try it as time allows.

My primary use case is running DeepSeek Coder 33B (alternatively CodeLlama 34B) with --tensor-parallel=2.

The A16 likely has better quality (correctness) anyway, so that's a good first choice.

If this approach works well with GGUF, then supporting the EXL2 format may work as well.

@theobjectivedad
Copy link

+1

1 similar comment
@SODAsoo07
Copy link

+1

@chu-tianxiang
Copy link
Contributor

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:

python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf

The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.

@tutu329
Copy link

tutu329 commented Feb 13, 2024

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:

python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf

The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.

can you please check if vllm can inter miqu-1-70b-sf-gptq correctly? it reports oom on my machine. (other 70b gptq model like qwen are just ok)

@paramssk
Copy link

I have been trying the below command with vLLM version 0.3.0 in Linux Ubuntu CPU machine.

python3 -m vllm.entrypoints.api_server --root-path models/ --model llama-2-7b.Q5_K_M.gguf --host 0.0.0.0 --port 8080

Facing the error as below, any help please.

File "/home/ubuntu/ragas/lib/python3.10/site-packages/transformers/utils/hub.py", line 406, in cached_file
raise EnvironmentError(
OSError: llama-2-7b.Q5_K_M.gguf is not a local folder and is not a valid model identifier listed on 'https://hugg
ingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in wi
th huggingface-cli login or by passing token=<your_token>

@Kev1ntan
Copy link

Kev1ntan commented Mar 4, 2024

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:

python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf

The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.

hi, is it support for mistral gguf?. just tried but got error
OSError: It looks like the config file at '../models/FL1-base-7B-Q8_0.gguf' is not a valid JSON file.

@chu-tianxiang
Copy link
Contributor

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:

python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf

The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.

hi, is it support for mistral gguf?. just tried but got error OSError: It looks like the config file at '../models/FL1-base-7B-Q8_0.gguf' is not a valid JSON file.

Yes I tested mistral gguf with no problem. Please be ware that you have to install the custom branch from source instead of the official build. As an alternative, you can use aphrodite-engine too, which also integrates GGUF support and is easier to install.

@lutemartin
Copy link

+1 gguf support please!

For those of us that have downloaded a large archive of gguf models, it would be a great benefit to use the vLLM project with the artifacts we have already downloaded and available, rather than downloading fp16 or awq and consuming more disk resources.

1 similar comment
@micsama
Copy link

micsama commented Apr 26, 2024

+1 gguf support please!

For those of us that have downloaded a large archive of gguf models, it would be a great benefit to use the vLLM project with the artifacts we have already downloaded and available, rather than downloading fp16 or awq and consuming more disk resources.

@FarisHijazi
Copy link

+1 gguf support please

1 similar comment
@Searcherr
Copy link

+1 gguf support please

@fablerq
Copy link

fablerq commented Apr 30, 2024

+1 please

1 similar comment
@derekhsu
Copy link

derekhsu commented May 2, 2024

+1 please

@dibu28
Copy link

dibu28 commented May 4, 2024

+1 gguf support please

@Jaka-Kocjanc
Copy link

+1 guuf support please!!

@DavidPeleg6
Copy link

+1 :)

@DenisDiachkov
Copy link

+1

1 similar comment
@dibu28
Copy link

dibu28 commented May 22, 2024

+1

@PRAVEENKUMAR2003
Copy link

gguf support please!!

@gittb
Copy link

gittb commented Jun 8, 2024

+1 for support. Thanks for working on this @Isotr0py.

@FarisHijazi
Copy link

+1

1 similar comment
@hruday-markonda
Copy link

+1

@ichrnkv
Copy link

ichrnkv commented Jul 20, 2024

+1
Thanks!

@mahiatlinux
Copy link

+1
Rooting for GGUF too.

@vbiral
Copy link

vbiral commented Aug 6, 2024

@mgoin
I'm having some problems running the example provided in the dev docs (https://docs.vllm.ai/en/latest/getting_started/examples/gguf_inference.html)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 8: invalid continuation byte

Do you thing it's better to re-open this feature request? Or have you ever saw this error?

@Isotr0py
Copy link
Collaborator

Isotr0py commented Aug 6, 2024

@vbiral I can run the example code in a new conda environment with latest nightly wheel.
Seems that the GGUF model file is corrupted according to the error. Can you check the GGUF file integrity?

@vbiral
Copy link

vbiral commented Aug 6, 2024

@Isotr0py
I used a new venv and installed the 0.5.4 version.

The strange thing is that I tried with the model that you used in the example and with the bullerwins/Meta-Llama-3.1-70B-Instruct-GGUF. That's why I thought that maybe this was not a problem in the file.

In the end of the traceback I got this error too:

OSError: It looks like the config file at '/home/victor.placido/.cache/huggingface/hub/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf' is not a valid JSON file.

I'm running in a linux environment, will try in a Mac to check if the error continues.

@Isotr0py
Copy link
Collaborator

Isotr0py commented Aug 6, 2024

@vbiral Are you using the released 0.5.4 version wheel?
The released 0.5.4 version wheel hasn't included GGUF support yet. You should use the nightly wheel instead:

export VLLM_VERSION=0.5.4 # vLLM's main branch version is currently set to latest released tag
pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
# You can also access a specific commit
# export VLLM_COMMIT=...
# pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl

@vbiral
Copy link

vbiral commented Aug 6, 2024

@Isotr0py
Thanks! It worked!
Congratulations for the great work!
Sorry to bother you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.