Out of Memory error while doing Inference on GPU

Hello there,

Thank you for publishing model on the physionet.org. I have downloaded the `Me-LLaMA-13b-chat` model on my device having 24GB GPU and tried to make inference on it. But I couldn't load even the model. In documentation I have found that it can be fine-tuned on at least 24 GB GPU. But I am little bit confused here. Model is not loaded in the 24GB GPU even, how fine-tuning is possible. I have tried by reducing the model to 16 bit precision as well but still couldn't load on 24GB. Is there are any other way to load the model on the 24GB GPU without performing the quantization operation?


I have tried to make the inference on the model using following source code:
```
from transformers import AutoTokenizer, AutoModelForCausalLM 
import torch
torch.cuda.empty_cache()

model_file = "./physionet.org/files/me-llama/1.0.0/MeLLaMA-13B-chat"
prompt = "I am suffering from flu, give me home remedies?"

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the tokenizer and model from your local model directory.
tokenizer = AutoTokenizer.from_pretrained(model_file)
model = AutoModelForCausalLM.from_pretrained(model_file).to(device)
```

It trows `out of memory error` while loading the model. And my GPU memory have been fully utilized.
```
OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 81.69 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```
Following image shows the GPU image usage statistics:
![image](https://github.com/user-attachments/assets/c97cc7d3-0000-4f95-9856-38b30c0bb5eb)


Note: In CPU. it tooks around 49GB RAM on my PC.

Please help me if you have any idea about how to run the model on the 24 GB GPU efficiently. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory error while doing Inference on GPU #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Out of Memory error while doing Inference on GPU #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions