-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for running inference on multiple GPUs #20
Comments
I've seen this issue when running out of GPU RAM. Unfortunately, the model requires an A100 80GB right now. Are you using an A100 40GB? |
yeah! It's 40 GB, but I have 8 of them. Can I use them together to avoid this issue? The problem occurs after loading both model and retrieval index when I type out the prompt. |
For inference, I saw that some folks on Discord were able to run on multiple cards in this thread. I haven't had a chance to try it myself. For the retrieval index, you can control which GPU the index is loaded on by modifying this line, I believe. @LorrinWWW, any other advice? |
Can you share invite link to the discord server? Can't access the thread. |
Of course! https://discord.gg/9Rk6sSeWEG |
@satpalsr I can use multiple GPUs with the following snippet: import torch
from transformers import AutoConfig, AutoTokenizer
from transformers import AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory
tokenizer = AutoTokenizer.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B')
model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B')
max_memory = get_balanced_memory(
model,
max_memory=None,
no_split_module_classes=["GPTNeoXLayer"],
dtype='float16',
low_zero=False,
)
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["GPTNeoXLayer"],
dtype='float16'
)
model = dispatch_model(model, device_map=device_map) But I recommend just using two A100 40G, because more doesn't provide acceleration. |
I'm re-purposing this issue to track adding multi-GPU inference documentation to the repo. |
can i use 4T4 GPU? is the 4T4 GPU memory enough?? thanks |
@LorrinWWW I use the same code to inference the model, but it still the error, is it mean that need to use more gpus, like 4 A100 40G?
|
Hey @trouble-maker007, I changed my torch version to solve it.
I also had bunch of other issues later, also mentioning them if you face same.
Solution:
Issue:
Solution
Issue:
Solution:
Issue:
Solution:
Issue:
Solution:
|
You can inference it with multi-gpus using
The environment is
I have 2 3090 cards available, and it costs about 42G cuda-mem with above script. The launch cmd is But it seems that the QA ability is poor and inference cost long time ( 30-40 secs with max_new_tokens=256).
the part of the answer repeats almost in the end. |
@better629 My inference is also slow, though I only use a single RTX-8000 GPU. I even load the model using |
@jasontian6666 RTX-8000 has 48G gpu-mem which it's enough to load model even without |
@better629 The original model is 48GB so I think for a single GPU, I would need something like A100-80GB. Loading it in |
@better629 This works for me. My environment is
|
Do you finetune the model successfully on a single RTX-8000 GPU? I have 4 RTX 6000 GPU, but I got CUDA out of memery error when running bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh. Could you please tell me how I should do? Thanks! |
Any updates? |
@csris any pointers on where I can find the 8x 80GB A100 instance type via the cloud? I checked Lambda labs and AWS And cant seem to find it. What do you use? |
Similar to what @Zaoyee found, I am observing that GPU memory is accumulating for every batch of inference. I tried |
@satpalsr We've added new documentation and options for running inference on multiple GPUs, specific GPUs, and consumer hardware! To run a 40 GB model (GPT-NeoXT-Chat-Base-20B) on 8 GPUs, I would recommend adding If you're running this on fewer than 8 GPUs, make sure that the Total VRAM > size of the model.
@jasontian6666 @Zaoyee @jimmychou0704 If you find yourselves running out of VRAM, read the updated docs and add Note, this method can be slow, but it works.
I have noticed that VRAM goes up by 100-200 MiB per prompt. I will look into what can be done, but for now, you should be able to offload parts of the model to the CPU RAM to make room for this and run it. P.S. Upgrading |
@Zaoyee I found the gpu-mem accumulation too, I will check it out. |
Close this issue and open a new one for gpu-mem accumulation? I believe this is solved. |
While trying out
python inference/bot.py --retrieval --model togethercomputer/GPT-NeoXT-Chat-Base-20B
I got this error on A100 GPU:
The text was updated successfully, but these errors were encountered: