We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM : v0.5.4
llm = LLM(model= "unsloth/Qwen2-7B-Instruct-bnb-4bit" , dtype='bfloat16', gpu_memory_utilization=0.95, quantization="bitsandbytes", load_format="bitsandbytes", enforce_eager=True )
I try to do inference using vllm from models in unsloth: https://huggingface.co/collections/unsloth/4bit-instruct-models-6624b1c17fd76cbcf4d435c8
however, some of the models work, such as unsloth/Mistral-Nemo-Base-2407-bnb-4bit , unsloth/Phi-3-mini-4k-instruct-bnb-4bit
unsloth/Mistral-Nemo-Base-2407-bnb-4bit
unsloth/Phi-3-mini-4k-instruct-bnb-4bit
but some of them, such as unsloth/Qwen2-7B-Instruct-bnb-4bit , unsloth/gemma-2-9b-it-bnb-4bit
unsloth/Qwen2-7B-Instruct-bnb-4bit
unsloth/gemma-2-9b-it-bnb-4bit
May I know the reasons ?
WARNING 08-17 07:20:49 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 08-17 07:20:49 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='unsloth/Qwen2-7B-Instruct-bnb-4bit', speculative_config=None, tokenizer='unsloth/Qwen2-7B-Instruct-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=unsloth/Qwen2-7B-Instruct-bnb-4bit, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-17 07:20:50 model_runner.py:720] Starting to load model unsloth/Qwen2-7B-Instruct-bnb-4bit... [rank0]: Traceback (most recent call last): [rank0]: File "/home/ubuntu/moa/vllminfer.py", line 18, in [rank0]: llm = LLM(model= args.llm_name , dtype='bfloat16', [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 158, in init [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 445, in from_engine_args [rank0]: engine = cls( [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init [rank0]: self.model_executor = executor_class( [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init [rank0]: self._init_executor() [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor [rank0]: self.driver_worker.load_model() [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model [rank0]: self.model_runner.load_model() [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model [rank0]: self.model = get_model(model_config=self.model_config, [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 946, in load_model [rank0]: self._load_weights(model_config, model) [rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 867, in _load_weights [rank0]: raise AttributeError( [rank0]: AttributeError: Model BitsAndBytesModelLoader does not support BitsAndBytes quantization yet.
The text was updated successfully, but these errors were encountered:
there's a bug in this error message. I will fix this bug. It should report something like:
AttributeError: Model Qwen2ForCausalLM does not support BitsAndBytes quantization yet.
BTW, some models currently do not support BNB yet, see: https://github.com/vllm-project/vllm/blob/v0.5.4/vllm/model_executor/model_loader/loader.py#L866
Sorry, something went wrong.
same error
vllm serve /ai/qwen2-7b --host 0.0.0.0 --port 10860 --max-model-len 4096 --trust-remote-code --tensor-parallel-size 1 --dtype=half --quantization bitsandbytes --load-format bitsandbytes --enforce-eager
Check #9574, this fix the issues perfectly
Successfully merging a pull request may close this issue.
Your current environment
LLM : v0.5.4
I try to do inference using vllm from models in unsloth: https://huggingface.co/collections/unsloth/4bit-instruct-models-6624b1c17fd76cbcf4d435c8
however, some of the models work, such as
unsloth/Mistral-Nemo-Base-2407-bnb-4bit
,unsloth/Phi-3-mini-4k-instruct-bnb-4bit
but some of them, such as
unsloth/Qwen2-7B-Instruct-bnb-4bit
,unsloth/gemma-2-9b-it-bnb-4bit
May I know the reasons ?
🐛 Describe the bug
The text was updated successfully, but these errors were encountered: