CUDA error: out of memory

I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:

```
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model='/mnt/d/github/text-generation-webui/models/facebook_opt-125m', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281
Traceback (most recent call last):
  File "/mnt/d/01Projects/vllm/prac_1.py", line 11, in <module>
    llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")
  File "/mnt/d/github/vllm/vllm/entrypoints/llm.py", line 55, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args
    engine = cls(*engine_configs, distributed_init_method, devices,
  File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 102, in __init__
    self._init_cache()
  File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 134, in _init_cache
    self._run_workers("init_cache_engine", cache_config=self.cache_config)
  File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers
    output = executor(*args, **kwargs)
  File "/mnt/d/github/vllm/vllm/worker/worker.py", line 126, in init_cache_engine
    self.cache_engine = CacheEngine(
  File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 41, in __init__
    self.cpu_cache = self.allocate_cpu_cache()
  File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 89, in allocate_cpu_cache
    key_blocks = torch.empty(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Python: 3.10.11
GPU: RTX 3090 24G
Linux: WSL2, Ubuntu 20.04.6 LTS
Can anyone help to answer this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CUDA error: out of memory #188

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CUDA error: out of memory #188

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions