Skip to content

CUDA error: out of memory #188

Closed
@SunixLiu

Description

@SunixLiu

I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model='/mnt/d/github/text-generation-webui/models/facebook_opt-125m', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281
Traceback (most recent call last):
File "/mnt/d/01Projects/vllm/prac_1.py", line 11, in
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")
File "/mnt/d/github/vllm/vllm/entrypoints/llm.py", line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 102, in init
self._init_cache()
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 134, in _init_cache
self._run_workers("init_cache_engine", cache_config=self.cache_config)
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers
output = executor(*args, **kwargs)
File "/mnt/d/github/vllm/vllm/worker/worker.py", line 126, in init_cache_engine
self.cache_engine = CacheEngine(
File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 41, in init
self.cpu_cache = self.allocate_cpu_cache()
File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 89, in allocate_cpu_cache
key_blocks = torch.empty(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Python: 3.10.11
GPU: RTX 3090 24G
Linux: WSL2, Ubuntu 20.04.6 LTS
Can anyone help to answer this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions