-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the correct way to use quantized versions of vicuna or guanco? #210
Comments
We haven't tested quantized models by ourselves but their support is in our plan. Can you share your model, code, and the error message you get? |
@armsp vLLM does not support quantization at the moment. However, could you let us know the data type & quantization method you use for the models? That will definitely help our development decision. |
For what it's worth, https://github.com/qwopqwop200/GPTQ-for-LLaMa and https://github.com/PanQiWei/AutoGPTQ seem to be the most common mentions from folks posting quantized models on huggingface lately - the later more just for general use. https://github.com/SqueezeAILab/SqueezeLLM seems to do a much better job of 3bit in particular, but I haven't seen much general use yet.. maybe down the road? |
@WoosukKwon @zhuohan123 sure. I wasn't trying anything fancy neither was I trying to quantize my own models. I was just trying to use the 4 and 5_1 bit quantized models that are available and I tried to do that by just changing the model path/name. Not quite sure how to use local models yet, but maybe using local models works? For example - there is a consumer hardware version of Downloading (…)lve/main/config.json: 100%|████████████████| 568/568 [00:00<00:00, 307kB/s]
INFO 06-23 10:59:16 llm_engine.py:59] Initializing an LLM engine with config: model='JosephusCheung/GuanacoOnConsumerHardware', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-23 10:59:16 tokenizer_utils.py:30] Using the LLaMA fast tokenizer in 'hf-internal-testing/llama-tokenizer' to avoid potential protobuf errors.
Traceback (most recent call last):
File "demo.py", line 20, in <module>
llm = LLM(model="JosephusCheung/GuanacoOnConsumerHardware") # Name or path of your model
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 55, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 151, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 93, in __init__
worker = worker_cls(
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/worker/worker.py", line 45, in __init__
self.model = get_model(model_config)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 39, in get_model
model = model_class(model_config.hf_config)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 215, in __init__
self.model = LlamaModel(config)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in __init__
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in <listcomp>
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 136, in __init__
self.mlp = LlamaMLP(
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 59, in __init__
self.gate_up_proj = ColumnParallelLinear(hidden_size, 2 * intermediate_size,
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in __init__
self.weight = Parameter(torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 6.00 GiB total capacity; 5.27 GiB already allocated; 0 bytes free; 5.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF I tried the quantized version of vicuna: https://huggingface.co/vicuna/ggml-vicuna-7b-1.1, https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized, https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g |
Also support for gguf would be nice |
Closing because quantization is supported via GPTQ, AWQ, SqueezeLLM. |
…ct#210) Fixes assert seen when "prompt_logprobs is not None" and BS > 1. Assert was due to shape of paddings being added to matching sampling_metadata.selected_token_indices shape for the case where prompt_logprobs is configured.
…hen num-scheduler-steps>1 (opendatahub-io#210)
I have been trying to use quantized versions of models to use my GPU whose VRAM is 6GB max. However nothing seems to work. How would I go about using 5bit versions that use under 6GB in memory?
The text was updated successfully, but these errors were encountered: