-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Phi3 still not supported #4375
Comments
The release version 0.4.1 does not yet support Phi3. You can build the source from the main branch, which does support Phi3. |
When do you plan to release 0.4.1.post with official phi3 support? Also, thanks a lot for the amazing work you’ve been doing! |
+1 |
Phi-3 still not supported in the main branch even though #4298 is merged. Will there be any estimate about official release date? Thanks |
+1 |
2 similar comments
+1 |
+1 |
+1, Phi-3 still not supported in the main branch even though #4298 is merged |
+1,still bugs in phi-3, which generation will not stop |
I cannot run Phi-3 with the docker run -it --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8081:8000 \
--ipc=host \
--name vllm-openai-phi3 \
vllm/vllm-openai:v0.4.2 \
--model microsoft/Phi-3-mini-128k-instruct \
--max-model-len 128000 \
--dtype float16 I have tried to build the image from the source (main branch), but it takes a long time (45 minutes and still not finished). |
Are you still encountering this issue in the latest version? |
In my setup, nvidia-smi
Fri Jun 14 07:06:10 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 Off | N/A |
| 54% 23C P8 12W / 350W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:07:00.0 Off | N/A |
| 54% 22C P8 13W / 350W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+ Logsdocker compose up
WARN[0000] Found orphan containers ([backend-vllm-openai-phi3-1]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
[+] Running 2/2
✔ Network backend_default Created 0.1s
✔ Container backend-vllm-openai-1 Created 0.1s
Attaching to backend-vllm-openai-1
backend-vllm-openai-1 | /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
backend-vllm-openai-1 | warnings.warn(
backend-vllm-openai-1 | WARNING 06-14 07:04:02 config.py:1155] Casting torch.bfloat16 to torch.float16.
backend-vllm-openai-1 | 2024-06-14 07:04:06,104 INFO worker.py:1749 -- Started a local Ray instance.
backend-vllm-openai-1 | INFO 06-14 07:04:07 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='cognitivecomputations/dolphin-2.9-llama3-8b', speculative_config=None, tokenizer='cognitivecomputations/dolphin-2.9-llama3-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=cognitivecomputations/dolphin-2.9-llama3-8b)
backend-vllm-openai-1 | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] Traceback (most recent call last):
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] return executor(*args, **kwargs)
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] self.worker = worker_class(*args, **kwargs)
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] self.model_runner = ModelRunnerClass(
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] self.attn_backend = get_attn_backend(
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] if torch.cuda.get_device_capability()[0] < 8:
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] prop = get_device_properties(device)
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] _lazy_init() # will define _get_device_properties
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] torch._C._cuda_init()
backend-vllm-openai-1 | ERROR 06-14 07:04:12 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
backend-vllm-openai-1 | Traceback (most recent call last):
backend-vllm-openai-1 | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
backend-vllm-openai-1 | return _run_code(code, main_globals, None,
backend-vllm-openai-1 |
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
backend-vllm-openai-1 | exec(code, run_globals)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
backend-vllm-openai-1 | engine = AsyncLLMEngine.from_engine_args(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
backend-vllm-openai-1 | engine = cls(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
backend-vllm-openai-1 | self.engine = self._init_engine(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
backend-vllm-openai-1 | return engine_class(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 222, in __init__
backend-vllm-openai-1 | self.model_executor = executor_class(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 317, in __init__
backend-vllm-openai-1 | super().__init__(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
backend-vllm-openai-1 | super().__init__(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
backend-vllm-openai-1 | self._init_executor()
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
backend-vllm-openai-1 | self._init_workers_ray(placement_group)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 169, in _init_workers_ray
backend-vllm-openai-1 | self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
backend-vllm-openai-1 | driver_worker_output = self.driver_worker.execute_method(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
backend-vllm-openai-1 | raise e
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
backend-vllm-openai-1 | return executor(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
backend-vllm-openai-1 | self.worker = worker_class(*args, **kwargs)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
backend-vllm-openai-1 | self.model_runner = ModelRunnerClass(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
backend-vllm-openai-1 | self.attn_backend = get_attn_backend(
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
backend-vllm-openai-1 | backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
backend-vllm-openai-1 | if torch.cuda.get_device_capability()[0] < 8:
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
backend-vllm-openai-1 | prop = get_device_properties(device)
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
backend-vllm-openai-1 | _lazy_init() # will define _get_device_properties
backend-vllm-openai-1 | File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
backend-vllm-openai-1 | torch._C._cuda_init()
backend-vllm-openai-1 | RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] Traceback (most recent call last):
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] return executor(*args, **kwargs)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] return method(self, *_args, **_kwargs)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] self.worker = worker_class(*args, **kwargs)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] self.model_runner = ModelRunnerClass(
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] self.att
n_backend = get_attn_backend(
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] if torch.cuda.get_device_capability()[0] < 8:
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] prop = get_device_properties(device)
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] _lazy_init() # will define _get_device_properties
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] torch._C._cuda_init()
backend-vllm-openai-1 | (RayWorkerWrapper pid=643) ERROR 06-14 07:04:12 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
backend-vllm-openai-1 exited with code 1 |
Can you try installing the latest vLLM version ( |
@RasoulNik your issue is a driver issue. see #4940 (comment) |
The 0.5.0.post1 does not work for me either. I am going to try @youkaichao's suggestion. |
Update ConfirmationI have updated my drivers, and now everything works properly. I can use nvidia-smi
Fri Jun 14 09:00:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 Off | N/A |
| 53% 47C P2 149W / 350W | 21756MiB / 24576MiB | 29% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:07:00.0 Off | N/A |
| 54% 23C P8 13W / 350W | 4MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 11138 C python3 21742MiB |
+-----------------------------------------------------------------------------------------+ Docker Compose Configurationservices:
vllm-openai-phi3:
image: vllm/vllm-openai:v0.5.0.post1
environment:
- HUGGING_FACE_HUB_TOKEN=<>
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ports:
- 8081:8000
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- "--model"
- "microsoft/Phi-3-mini-128k-instruct"
- "--max-model-len"
- "20000"
- "--dtype"
- "float16" |
Your current environment
🐛 Describe the bug
Phi-3 still seems to not be supported after latest vllm install.
The text was updated successfully, but these errors were encountered: