-
Couldn't load subscription status.
- Fork 131
Open
Description
I deployed Qwen2.5-1.5B-Instruct model on A100 using tensorrtllm_backend using the following parameter values:
TRITON_MAX_BATCH_SIZE=1024
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=32
MAX_QUEUE_SIZE=0
DECOUPLED_MODE=true
LOGITS_DATATYPE=TYPE_FP32
the model gets deployed successfully, but I am not able to hit more than 228 request for any number of users and hatch rate for a run time of 60 seconds in locust with constant_throughput(1).
for example, for num users = 100, hatch rate = 100, the timings are:
p50 = 26000 ms
p90 = 26000 ms
p99 = 27000 ms
p100 = 27000 ms
reqs = 223
what changes do I need to make?
Metadata
Metadata
Assignees
Labels
No labels