max request capped at 228 while load testing using locust

I deployed Qwen2.5-1.5B-Instruct model on A100 using tensorrtllm_backend using the following parameter values:

TRITON_MAX_BATCH_SIZE=1024
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=32
MAX_QUEUE_SIZE=0
DECOUPLED_MODE=true
LOGITS_DATATYPE=TYPE_FP32

the model gets deployed successfully, but I am not able to hit more than 228 request for any number of users and hatch rate for a run time of 60 seconds in locust with constant_throughput(1).
for example, for num users = 100, hatch rate = 100, the timings are:

p50 = 26000 ms
p90 = 26000 ms
p99 = 27000 ms
p100 = 27000 ms
reqs = 223

what changes do I need to make?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

max request capped at 228 while load testing using locust #740

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

max request capped at 228 while load testing using locust #740

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions