Running two different models on one GPU #1046

daytonturner · 2023-09-21T17:30:46Z

daytonturner
Sep 21, 2023

I have an A6000 48GB, and I'd like to be able to serve a quantized Llama-2 and WizardCoder, both of which can easily fit inside the 48GB available, but I'm unsure the best way to go about this - or if its a bad idea for some reason?

Initially, I thought simply running two TGI instances, each pointing to the respective model would be a reasonable approach, but I'm wondering if my assumptions are correct? Any thoughts?

Narsil · 2023-10-02T08:23:58Z

Narsil
Oct 2, 2023
Maintainer

This is the correct way to go about it. Use --cuda-memory-fraction to ensure deployments don't overlap each other (more free VRAM usually means more users on the same compute therefore tgi by default uses all the available VRAM)

1 reply

flozi00 Oct 25, 2023

Attention that it always calculates from the free memory
So using the same command twice will result in using only 0.75% of GPU Vram

First command needs 0.5 memory and second 0.9 or 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running two different models on one GPU #1046

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Running two different models on one GPU #1046

daytonturner Sep 21, 2023

Replies: 1 comment · 1 reply

Narsil Oct 2, 2023 Maintainer

flozi00 Oct 25, 2023

daytonturner
Sep 21, 2023

Replies: 1 comment 1 reply

Narsil
Oct 2, 2023
Maintainer