Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the correct way to use quantized versions of vicuna or guanco? #210

Closed
armsp opened this issue Jun 22, 2023 · 6 comments
Closed

Comments

@armsp
Copy link

armsp commented Jun 22, 2023

I have been trying to use quantized versions of models to use my GPU whose VRAM is 6GB max. However nothing seems to work. How would I go about using 5bit versions that use under 6GB in memory?

@zhuohan123
Copy link
Member

We haven't tested quantized models by ourselves but their support is in our plan. Can you share your model, code, and the error message you get?

@WoosukKwon
Copy link
Collaborator

@armsp vLLM does not support quantization at the moment. However, could you let us know the data type & quantization method you use for the models? That will definitely help our development decision.

@dillonroach
Copy link

For what it's worth, https://github.com/qwopqwop200/GPTQ-for-LLaMa and https://github.com/PanQiWei/AutoGPTQ seem to be the most common mentions from folks posting quantized models on huggingface lately - the later more just for general use. https://github.com/SqueezeAILab/SqueezeLLM seems to do a much better job of 3bit in particular, but I haven't seen much general use yet.. maybe down the road?

@armsp
Copy link
Author

armsp commented Jun 23, 2023

@WoosukKwon @zhuohan123 sure. I wasn't trying anything fancy neither was I trying to quantize my own models. I was just trying to use the 4 and 5_1 bit quantized models that are available and I tried to do that by just changing the model path/name. Not quite sure how to use local models yet, but maybe using local models works? For example - there is a consumer hardware version of JosephusCheung/Guanaco -> https://huggingface.co/JosephusCheung/GuanacoOnConsumerHardware which is supposed to take only 5+GB V-Ram.
The error I get for that is -

Downloading (…)lve/main/config.json: 100%|████████████████| 568/568 [00:00<00:00, 307kB/s]
INFO 06-23 10:59:16 llm_engine.py:59] Initializing an LLM engine with config: model='JosephusCheung/GuanacoOnConsumerHardware', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-23 10:59:16 tokenizer_utils.py:30] Using the LLaMA fast tokenizer in 'hf-internal-testing/llama-tokenizer' to avoid potential protobuf errors.
Traceback (most recent call last):
  File "demo.py", line 20, in <module>
    llm = LLM(model="JosephusCheung/GuanacoOnConsumerHardware")  # Name or path of your model
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 55, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 151, in from_engine_args
    engine = cls(*engine_configs, distributed_init_method, devices,
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 93, in __init__
    worker = worker_cls(
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/worker/worker.py", line 45, in __init__
    self.model = get_model(model_config)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 39, in get_model
    model = model_class(model_config.hf_config)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 215, in __init__
    self.model = LlamaModel(config)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in __init__
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in <listcomp>
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 136, in __init__
    self.mlp = LlamaMLP(
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 59, in __init__
    self.gate_up_proj = ColumnParallelLinear(hidden_size, 2 * intermediate_size,
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in __init__
    self.weight = Parameter(torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 6.00 GiB total capacity; 5.27 GiB already allocated; 0 bytes free; 5.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried the quantized version of vicuna: https://huggingface.co/vicuna/ggml-vicuna-7b-1.1, https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized, https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g
For other models the errors I get are either OOM, or that I need login access to huggingface hub.

@thistleknot
Copy link

Also support for gguf would be nice

@hmellor
Copy link
Collaborator

hmellor commented Mar 8, 2024

Closing because quantization is supported via GPTQ, AWQ, SqueezeLLM.

@hmellor hmellor closed this as completed Mar 8, 2024
yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024
jikunshang pushed a commit to jikunshang/vllm that referenced this issue Sep 6, 2024
…ct#210)

Fixes assert seen when "prompt_logprobs is not None" and BS > 1. Assert
was due to shape of paddings being added to matching
sampling_metadata.selected_token_indices shape for the case where
prompt_logprobs is configured.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants