Skip to content

Out of Memory error while doing Inference on GPU #9

@np-n

Description

@np-n

Hello there,

Thank you for publishing model on the physionet.org. I have downloaded the Me-LLaMA-13b-chat model on my device having 24GB GPU and tried to make inference on it. But I couldn't load even the model. In documentation I have found that it can be fine-tuned on at least 24 GB GPU. But I am little bit confused here. Model is not loaded in the 24GB GPU even, how fine-tuning is possible. I have tried by reducing the model to 16 bit precision as well but still couldn't load on 24GB. Is there are any other way to load the model on the 24GB GPU without performing the quantization operation?

I have tried to make the inference on the model using following source code:

from transformers import AutoTokenizer, AutoModelForCausalLM 
import torch
torch.cuda.empty_cache()

model_file = "./physionet.org/files/me-llama/1.0.0/MeLLaMA-13B-chat"
prompt = "I am suffering from flu, give me home remedies?"

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the tokenizer and model from your local model directory.
tokenizer = AutoTokenizer.from_pretrained(model_file)
model = AutoModelForCausalLM.from_pretrained(model_file).to(device)

It trows out of memory error while loading the model. And my GPU memory have been fully utilized.

OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 81.69 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Following image shows the GPU image usage statistics:
image

Note: In CPU. it tooks around 49GB RAM on my PC.

Please help me if you have any idea about how to run the model on the 24 GB GPU efficiently. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions