Closed
Description
I have two questions:
-
I attempted multi-GPU inference (8 GPU inference on A100) on Llama-13B. I followed the steps described in [https://github.com/CUDA error: out of memory #188], first running
$ ray start --head
and thenllm = LLM(model=<your model>, tensor_parallel_size=8)
.
However, I got the following error:
(Worker pid=1027546) AssertionError: 32001 is not divisible by 8 [repeated 7x across cluster]
Is there any way to resolve this issue? -
Additionally, is there a way to specify which GPUs are used during inference? I tried using
os.environ["CUDA_VISIBLE_DEVICES"]="2"
but it doesn't seem to work - it continues to use the first GPU.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels