Closed
Description
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
distributed_executor_backed="mp"
is now enabled by default for vLLM. However, this feature is currently incompatible with some GPTQ quantization for tp>1 due to the order in which torch is initialized. We get the classic RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Setting distributed_backend_executor="ray"
works for GPTQ
The following fails:
from vllm import LLM
MODEL_NAME="TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ"
TENSOR_PARALLEL_SIZE=2
model = LLM(MODEL_NAME, enforce_eager=True, tensor_parallel_size=TENSOR_PARALLEL_SIZE, distributed_executor_backend="mp")
print(model.generate("The best thing about the internet is")[0].outputs[0].text)
with:
(vllm-upstream) rshaw@beaker:~/vllm$ python3 run.py
INFO 06-30 14:26:18 gptq_marlin.py:140] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-30 14:26:18 llm_engine.py:169] Initializing an LLM engine (v0.5.0.post1) with config: model='TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ', speculative_config=None, tokenizer='TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ)
(VllmWorkerProcess pid=3438507) Process VllmWorkerProcess:
(VllmWorkerProcess pid=3438507) Traceback (most recent call last):
(VllmWorkerProcess pid=3438507) File "/home/rshaw/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=3438507) self.run()
(VllmWorkerProcess pid=3438507) File "/home/rshaw/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=3438507) self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=3438507) worker = worker_factory()
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm/vllm/executor/gpu_executor.py", line 67, in _create_worker
(VllmWorkerProcess pid=3438507) wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm/vllm/worker/worker_base.py", line 311, in init_worker
(VllmWorkerProcess pid=3438507) self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm/vllm/worker/worker.py", line 86, in __init__
(VllmWorkerProcess pid=3438507) self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm/vllm/worker/model_runner.py", line 196, in __init__
(VllmWorkerProcess pid=3438507) self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm/vllm/attention/selector.py", line 45, in get_attn_backend
(VllmWorkerProcess pid=3438507) backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm/vllm/attention/selector.py", line 151, in which_attn_to_use
(VllmWorkerProcess pid=3438507) if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm-upstream/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=3438507) prop = get_device_properties(device)
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm-upstream/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=3438507) _lazy_init() # will define _get_device_properties
(VllmWorkerProcess pid=3438507) File "/home/rshaw/vllm-upstream/lib/python3.10/site-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=3438507) raise RuntimeError(
(VllmWorkerProcess pid=3438507) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 06-30 14:26:19 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3438507 died, exit code: 1
INFO 06-30 14:26:19 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Workarounds for now:
- Set
distributed_executor_backend="mp"