Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tgi crash on multi GPUs #2207

Closed
2 of 4 tasks
RohanSohani30 opened this issue Jul 9, 2024 · 11 comments
Closed
2 of 4 tasks

Tgi crash on multi GPUs #2207

RohanSohani30 opened this issue Jul 9, 2024 · 11 comments
Labels
bug Something isn't working Stale

Comments

@RohanSohani30
Copy link

System Info

I am trying to run TGI on Docker using 8 GPUs with 16GB each (In-house server) . Docker works fine with using single GPU.
My server crashes when using all GPUs. is there any other way to do it.
PS. I Need to use all GPUs so I can load big models. using single GPU I can use small models with less max-input-lenght

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. docker run --gpus all --name tgi --shm-size 1g --cpus="5.0" --rm --runtime=nvidia -e HUGGING_FACE_HUB_TOKEN=******* -p 8060:80 -v '$PATH':/data ghcr.io/huggingface/text-generation-inference --model-id meta-llama/Meta-Llama-3-8B --num-shard 8 --max-input-length 14000 --max-batch-prefill-tokens 14000 --max-total-tokens 16000

Expected behavior

INFO text_generation_router: router/src/main.rs:242: Using the Hugging Face API to retrieve tokenizer config
INFO text_generation_router: router/src/main.rs:291: Warming up model
WARN text_generation_router: router/src/main.rs:306: Model does not support automatic max batch total tokens
INFO text_generation_router: router/src/main.rs:328: Setting max batch total tokens to 16000
INFO text_generation_router: router/src/main.rs:329: Connected

@bwhartlove
Copy link

Seeing a similar issue on my end.

@Hugoch
Copy link
Member

Hugoch commented Jul 10, 2024

@RohanSohani30 Can you share the output of TGI when it errors?

@Hugoch Hugoch added the question Further information is requested label Jul 10, 2024
@HoKim98
Copy link

HoKim98 commented Jul 11, 2024

I had the similer problem like #2192, and was able to solve it by trying --cuda-graphs 0 method like #2099. This obviously caused major performance problems, but it was at least a better option than being broken.

@RohanSohani30
Copy link
Author

@RohanSohani30 Can you share the output of TGI when it errors?

There are no errors but system is getting crashed while the Warming model.

@Hugoch
Copy link
Member

Hugoch commented Jul 13, 2024

Yeah seems related to CUDA graphs and a bug introduced in NCCL 2.20.5. Can you retry with the latest docker image as #2099 was merged?

@Hugoch Hugoch added bug Something isn't working and removed question Further information is requested labels Jul 13, 2024
@RohanSohani30
Copy link
Author

Yeah seems related to CUDA graphs and a bug introduced in NCCL 2.20.5. Can you retry with the latest docker image as #2099 was merged?

I am using the latest docker image. Still facing the same issue.
I found one quick fix: when I use 2 or 4 GPUs, TGI image is running using --runtime=nvidia --env BUILD_EXTENSIONS=False --env NCCL_SHM_DISABLE=1
but here is a catch. let's say a model takes 11 GB of GPU memory to load on a single GPU. but when I use 2 GPUs it takes more than 11GB per GPU the total GPU would go above 25GB. I used --cuda-memory-fraction to limit GPU usage per GPU.
I want to load a model across multiple GPUs so, I can load big models.
I am missing something?

@Hugoch
Copy link
Member

Hugoch commented Jul 18, 2024

If disabling SHM solves the issue, it means that there is a problem on the way your system handles SHM. How much RAM do you have on the machine?
If I understand well, loading the model on 1GPU takes 11G VRAM while you OOM when using 2 GPUs?

@HoKim98 does the latest Docker image made it work?

@HoKim98
Copy link

HoKim98 commented Jul 18, 2024

@Hugoch It seems to be working! Had a 10-min stress testing and no errors were found.

@RohanSohani30
Copy link
Author

If disabling SHM solves the issue, it means that there is a problem on the way your system handles SHM. How much RAM do you have on the machine? If I understand well, loading the model on 1GPU takes 11G VRAM while you OOM when using 2 GPUs?

@HoKim98 does the latest Docker image made it work?

1TB RAM with 8*16G VRAM.
while using 2 GPUs I am getting OOM if model is big (above 22B) .
There is another scenario. when I am loading model using TGI CLI. I can load big models on all 8 GPUs without OOM. but tokens per sec are very low ...less than 1. using CLI model is distributed accross all GPUs.

@Hugoch
Copy link
Member

Hugoch commented Jul 19, 2024

Llama3-8B has a context of 8k, so you probably want to reduce max-total-tokens, and max-input-length. Try to set a low max-batch-total-tokens to check if you can load the model. If that's the case then you can incrementally increase it until OOM.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 19, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

4 participants