Tgi crash on multi GPUs #2207

RohanSohani30 · 2024-07-09T03:26:13Z

System Info

I am trying to run TGI on Docker using 8 GPUs with 16GB each (In-house server) . Docker works fine with using single GPU.
My server crashes when using all GPUs. is there any other way to do it.
PS. I Need to use all GPUs so I can load big models. using single GPU I can use small models with less max-input-lenght

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

docker run --gpus all --name tgi --shm-size 1g --cpus="5.0" --rm --runtime=nvidia -e HUGGING_FACE_HUB_TOKEN=******* -p 8060:80 -v '$PATH':/data ghcr.io/huggingface/text-generation-inference --model-id meta-llama/Meta-Llama-3-8B --num-shard 8 --max-input-length 14000 --max-batch-prefill-tokens 14000 --max-total-tokens 16000

Expected behavior

INFO text_generation_router: router/src/main.rs:242: Using the Hugging Face API to retrieve tokenizer config
INFO text_generation_router: router/src/main.rs:291: Warming up model
WARN text_generation_router: router/src/main.rs:306: Model does not support automatic max batch total tokens
INFO text_generation_router: router/src/main.rs:328: Setting max batch total tokens to 16000
INFO text_generation_router: router/src/main.rs:329: Connected

bwhartlove · 2024-07-09T13:26:52Z

Seeing a similar issue on my end.

Hugoch · 2024-07-10T13:16:54Z

@RohanSohani30 Can you share the output of TGI when it errors?

HoKim98 · 2024-07-11T04:45:36Z

I had the similer problem like #2192, and was able to solve it by trying --cuda-graphs 0 method like #2099. This obviously caused major performance problems, but it was at least a better option than being broken.

RohanSohani30 · 2024-07-11T08:30:39Z

@RohanSohani30 Can you share the output of TGI when it errors?

There are no errors but system is getting crashed while the Warming model.

Hugoch · 2024-07-13T12:53:38Z

Yeah seems related to CUDA graphs and a bug introduced in NCCL 2.20.5. Can you retry with the latest docker image as #2099 was merged?

RohanSohani30 · 2024-07-18T06:08:19Z

Yeah seems related to CUDA graphs and a bug introduced in NCCL 2.20.5. Can you retry with the latest docker image as #2099 was merged?

I am using the latest docker image. Still facing the same issue.
I found one quick fix: when I use 2 or 4 GPUs, TGI image is running using --runtime=nvidia --env BUILD_EXTENSIONS=False --env NCCL_SHM_DISABLE=1
but here is a catch. let's say a model takes 11 GB of GPU memory to load on a single GPU. but when I use 2 GPUs it takes more than 11GB per GPU the total GPU would go above 25GB. I used --cuda-memory-fraction to limit GPU usage per GPU.
I want to load a model across multiple GPUs so, I can load big models.
I am missing something?

Hugoch · 2024-07-18T09:06:02Z

If disabling SHM solves the issue, it means that there is a problem on the way your system handles SHM. How much RAM do you have on the machine?
If I understand well, loading the model on 1GPU takes 11G VRAM while you OOM when using 2 GPUs?

@HoKim98 does the latest Docker image made it work?

HoKim98 · 2024-07-18T09:48:34Z

@Hugoch It seems to be working! Had a 10-min stress testing and no errors were found.

RohanSohani30 · 2024-07-19T04:23:38Z

If disabling SHM solves the issue, it means that there is a problem on the way your system handles SHM. How much RAM do you have on the machine? If I understand well, loading the model on 1GPU takes 11G VRAM while you OOM when using 2 GPUs?

@HoKim98 does the latest Docker image made it work?

1TB RAM with 8*16G VRAM.
while using 2 GPUs I am getting OOM if model is big (above 22B) .
There is another scenario. when I am loading model using TGI CLI. I can load big models on all 8 GPUs without OOM. but tokens per sec are very low ...less than 1. using CLI model is distributed accross all GPUs.

Hugoch · 2024-07-19T16:34:26Z

Llama3-8B has a context of 8k, so you probably want to reduce max-total-tokens, and max-input-length. Try to set a low max-batch-total-tokens to check if you can load the model. If that's the case then you can incrementally increase it until OOM.

github-actions · 2024-08-19T01:54:31Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Hugoch added the question Further information is requested label Jul 10, 2024

Hugoch added bug Something isn't working and removed question Further information is requested labels Jul 13, 2024

github-actions bot added the Stale label Aug 19, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tgi crash on multi GPUs #2207

Tgi crash on multi GPUs #2207

RohanSohani30 commented Jul 9, 2024

bwhartlove commented Jul 9, 2024

Hugoch commented Jul 10, 2024

HoKim98 commented Jul 11, 2024 •

edited

Loading

RohanSohani30 commented Jul 11, 2024

Hugoch commented Jul 13, 2024

RohanSohani30 commented Jul 18, 2024

Hugoch commented Jul 18, 2024

HoKim98 commented Jul 18, 2024

RohanSohani30 commented Jul 19, 2024

Hugoch commented Jul 19, 2024

github-actions bot commented Aug 19, 2024

Tgi crash on multi GPUs #2207

Tgi crash on multi GPUs #2207

Comments

RohanSohani30 commented Jul 9, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

bwhartlove commented Jul 9, 2024

Hugoch commented Jul 10, 2024

HoKim98 commented Jul 11, 2024 • edited Loading

RohanSohani30 commented Jul 11, 2024

Hugoch commented Jul 13, 2024

RohanSohani30 commented Jul 18, 2024

Hugoch commented Jul 18, 2024

HoKim98 commented Jul 18, 2024

RohanSohani30 commented Jul 19, 2024

Hugoch commented Jul 19, 2024

github-actions bot commented Aug 19, 2024

HoKim98 commented Jul 11, 2024 •

edited

Loading