[MPT-30B] OutOfMemoryError: CUDA out of memory

Hi vllm dev team,
is vllm supposed to work with MPT-30B ? I tried loading it on AWS SageMaker using a `ml.g5.12xlarge` and even a `ml.g5.48xlarge` instance.

```py
from vllm import LLM, SamplingParams

llm = LLM(model="mosaicml/mpt-30b")
```
However  in both cases I run into this error:

```
OutOfMemoryError: CUDA out of memory. Tried to allocate 294.00 MiB (GPU 0; 22.19 GiB total capacity; 21.35 GiB already allocated; 46.50 MiB free; 21.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MPT-30B] OutOfMemoryError: CUDA out of memory #372

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[MPT-30B] OutOfMemoryError: CUDA out of memory #372

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions