-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Closed
Description
I'm able to run TheBloke/dolphin-2.6-mixtral-8x7b-AWQ on 2x 4090s on git hash 1db83e3 but in the new release I receive the following cuda out of memory error:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/user/projects/repos/vllm/vllm/entrypoints/openai/api_server.py", line 217, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/projects/repos/vllm/vllm/engine/async_llm_engine.py", line 617, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/projects/repos/vllm/vllm/engine/async_llm_engine.py", line 321, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/projects/repos/vllm/vllm/engine/async_llm_engine.py", line 366, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/projects/repos/vllm/vllm/engine/llm_engine.py", line 112, in __init__
self._init_cache()
File "/home/user/projects/repos/vllm/vllm/engine/llm_engine.py", line 339, in _init_cache
self._run_workers("warm_up_model")
File "/home/user/projects/repos/vllm/vllm/engine/llm_engine.py", line 977, in _run_workers
driver_worker_output = getattr(self.driver_worker,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/projects/repos/vllm/vllm/worker/worker.py", line 143, in warm_up_model
self.model_runner.capture_model(self.gpu_cache)
File "/home/user/projects/repos/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/projects/repos/vllm/vllm/worker/model_runner.py", line 676, in capture_model
graph_runner.capture(
File "/home/user/projects/repos/vllm/vllm/worker/model_runner.py", line 722, in capture
with torch.cuda.graph(self.graph, pool=memory_pool):
File "/home/user/projects/repos/vllm/.venv/lib/python3.11/site-packages/torch/cuda/graphs.py", line 197, in __exit__
self.cuda_graph.capture_end()
File "/home/user/projects/repos/vllm/.venv/lib/python3.11/site-packages/torch/cuda/graphs.py", line 88, in capture_end
super().capture_end()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Command:
RAY_memory_usage_threshold=1 python3 -m vllm.entrypoints.openai.api_server --model=TheBloke/dolphin-2.6-mixtral-8x7b-AWQ --gpu-memory-utilization 1.0 --tensor-parallel-size 2 --max-model-len 16096
Metadata
Metadata
Assignees
Labels
No labels