Closed
Description
Hi vllm dev team,
is vllm supposed to work with MPT-30B ? I tried loading it on AWS SageMaker using a ml.g5.12xlarge
and even a ml.g5.48xlarge
instance.
from vllm import LLM, SamplingParams
llm = LLM(model="mosaicml/mpt-30b")
However in both cases I run into this error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 294.00 MiB (GPU 0; 22.19 GiB total capacity; 21.35 GiB already allocated; 46.50 MiB free; 21.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF