Skip to content

[Bug ] Memory leak on latest release 0.2.7 #2624

@alimoezzi

Description

@alimoezzi

I'm able to run TheBloke/dolphin-2.6-mixtral-8x7b-AWQ on 2x 4090s on git hash 1db83e3 but in the new release I receive the following cuda out of memory error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/projects/repos/vllm/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/projects/repos/vllm/vllm/engine/async_llm_engine.py", line 617, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/projects/repos/vllm/vllm/engine/async_llm_engine.py", line 321, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/projects/repos/vllm/vllm/engine/async_llm_engine.py", line 366, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/projects/repos/vllm/vllm/engine/llm_engine.py", line 112, in __init__
    self._init_cache()
  File "/home/user/projects/repos/vllm/vllm/engine/llm_engine.py", line 339, in _init_cache
    self._run_workers("warm_up_model")
  File "/home/user/projects/repos/vllm/vllm/engine/llm_engine.py", line 977, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/projects/repos/vllm/vllm/worker/worker.py", line 143, in warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/home/user/projects/repos/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/projects/repos/vllm/vllm/worker/model_runner.py", line 676, in capture_model
    graph_runner.capture(
  File "/home/user/projects/repos/vllm/vllm/worker/model_runner.py", line 722, in capture
    with torch.cuda.graph(self.graph, pool=memory_pool):
  File "/home/user/projects/repos/vllm/.venv/lib/python3.11/site-packages/torch/cuda/graphs.py", line 197, in __exit__
    self.cuda_graph.capture_end()
  File "/home/user/projects/repos/vllm/.venv/lib/python3.11/site-packages/torch/cuda/graphs.py", line 88, in capture_end
    super().capture_end()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Command:

RAY_memory_usage_threshold=1 python3 -m vllm.entrypoints.openai.api_server --model=TheBloke/dolphin-2.6-mixtral-8x7b-AWQ  --gpu-memory-utilization 1.0 --tensor-parallel-size 2 --max-model-len 16096

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions