Closed
Description
Your current environment
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1041-nvidia-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 525.147.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7742 64-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 64 MiB (128 instances)
L3 cache: 512 MiB (32 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] torchaudio==2.1.2
[pip3] torchvision==0.16.2
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.26.4 py310h5f9d8c6_0
[conda] numpy-base 1.26.4 py310hb5e798b_0
[conda] nvidia-nccl-cu12 2.19.3 pypi_0 pypi
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.2.1 pypi_0 pypi
[conda] torchaudio 2.1.2 py310_cu121 pytorch
[conda] torchvision 0.16.2 py310_cu121 pytorch
[conda] triton 2.2.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
NIC7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
NIC8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
🐛 Describe the bug
Phi-3 still seems to not be supported after latest vllm install.
model_id = "microsoft/Phi-3-mini-4k-instruct"
llm = LLM(model=model_id, trust_remote_code=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 11
---> 11 llm = LLM(model=model_id, trust_remote_code=True)
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py:118, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, **kwargs)
98 kwargs["disable_log_stats"] = True
99 engine_args = EngineArgs(
100 model=model,
101 tokenizer=tokenizer,
(...)
116 **kwargs,
117 )
--> 118 self.llm_engine = LLMEngine.from_engine_args(
119 engine_args, usage_context=UsageContext.LLM_CLASS)
120 self.request_counter = Counter()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py:277, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
274 executor_class = GPUExecutor
276 # Create the LLM engine.
--> 277 engine = cls(
278 **engine_config.to_dict(),
279 executor_class=executor_class,
280 log_stats=not engine_args.disable_log_stats,
281 usage_context=usage_context,
282 )
283 return engine
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py:148, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
144 self.seq_counter = Counter()
145 self.generation_config_fields = _load_generation_config_dict(
146 model_config)
--> 148 self.model_executor = executor_class(
149 model_config=model_config,
150 cache_config=cache_config,
151 parallel_config=parallel_config,
152 scheduler_config=scheduler_config,
153 device_config=device_config,
154 lora_config=lora_config,
155 vision_language_config=vision_language_config,
156 speculative_config=speculative_config,
157 load_config=load_config,
158 )
160 self._initialize_kv_caches()
162 # If usage stat is enabled, collect relevant info.
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py:41, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config)
38 self.vision_language_config = vision_language_config
39 self.speculative_config = speculative_config
---> 41 self._init_executor()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:22, in GPUExecutor._init_executor(self)
16 """Initialize the worker and load the model.
17
18 If speculative decoding is enabled, we instead create the speculative
19 worker.
20 """
21 if self.speculative_config is None:
---> 22 self._init_non_spec_worker()
23 else:
24 self._init_spec_worker()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:51, in GPUExecutor._init_non_spec_worker(self)
36 self.driver_worker = Worker(
37 model_config=self.model_config,
38 parallel_config=self.parallel_config,
(...)
48 is_driver_worker=True,
49 )
50 self.driver_worker.init_device()
---> 51 self.driver_worker.load_model()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py:117, in Worker.load_model(self)
116 def load_model(self):
--> 117 self.model_runner.load_model()
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py:162, in ModelRunner.load_model(self)
160 def load_model(self) -> None:
161 with CudaMemoryProfiler() as m:
--> 162 self.model = get_model(
163 model_config=self.model_config,
164 device_config=self.device_config,
165 load_config=self.load_config,
166 lora_config=self.lora_config,
167 vision_language_config=self.vision_language_config,
168 parallel_config=self.parallel_config,
169 scheduler_config=self.scheduler_config,
170 )
172 self.model_memory_usage = m.consumed_memory
173 logger.info(f"Loading model weights took "
174 f"{self.model_memory_usage / float(2**30):.4f} GB")
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, vision_language_config)
13 def get_model(
14 *, model_config: ModelConfig, load_config: LoadConfig,
15 device_config: DeviceConfig, parallel_config: ParallelConfig,
16 scheduler_config: SchedulerConfig, lora_config: Optional[LoRAConfig],
17 vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
18 loader = get_model_loader(load_config)
---> 19 return loader.load_model(model_config=model_config,
20 device_config=device_config,
21 lora_config=lora_config,
22 vision_language_config=vision_language_config,
23 parallel_config=parallel_config,
24 scheduler_config=scheduler_config)
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:222, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, vision_language_config, parallel_config, scheduler_config)
220 with set_default_torch_dtype(model_config.dtype):
221 with torch.device(device_config.device):
--> 222 model = _initialize_model(model_config, self.load_config,
223 lora_config, vision_language_config)
224 model.load_weights(
225 self._get_weights_iterator(model_config.model,
226 model_config.revision,
(...)
229 "fall_back_to_pt_during_load",
230 True)), )
231 for _, module in model.named_modules():
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:87, in _initialize_model(model_config, load_config, lora_config, vision_language_config)
82 def _initialize_model(
83 model_config: ModelConfig, load_config: LoadConfig,
84 lora_config: Optional[LoRAConfig],
85 vision_language_config: Optional[VisionLanguageConfig]) -> nn.Module:
86 """Initialize a model with the given configurations."""
---> 87 model_class = get_model_architecture(model_config)[0]
88 linear_method = _get_linear_method(model_config, load_config)
90 return model_class(config=model_config.hf_config,
91 linear_method=linear_method,
92 **_get_model_initialization_kwargs(
93 model_class, lora_config, vision_language_config))
File ~/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py:35, in get_model_architecture(model_config)
33 if model_cls is not None:
34 return (model_cls, arch)
---> 35 raise ValueError(
36 f"Model architectures {architectures} are not supported for now. "
37 f"Supported architectures: {ModelRegistry.get_supported_archs()}")
ValueError: Model architectures ['Phi3ForCausalLM'] are not supported for now.