Description
I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model='/mnt/d/github/text-generation-webui/models/facebook_opt-125m', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281
Traceback (most recent call last):
File "/mnt/d/01Projects/vllm/prac_1.py", line 11, in
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")
File "/mnt/d/github/vllm/vllm/entrypoints/llm.py", line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 102, in init
self._init_cache()
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 134, in _init_cache
self._run_workers("init_cache_engine", cache_config=self.cache_config)
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers
output = executor(*args, **kwargs)
File "/mnt/d/github/vllm/vllm/worker/worker.py", line 126, in init_cache_engine
self.cache_engine = CacheEngine(
File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 41, in init
self.cpu_cache = self.allocate_cpu_cache()
File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 89, in allocate_cpu_cache
key_blocks = torch.empty(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Python: 3.10.11
GPU: RTX 3090 24G
Linux: WSL2, Ubuntu 20.04.6 LTS
Can anyone help to answer this?