You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS version: Linux
Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ
Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1xL40S
The current version being used: 2.0.4
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Launch TGI using max_total_tokens=max_batch_prefill_tokens=16384; max_input_length=16383; quantize=awq.
After making a few hundred requests, the pod returns empty packets and only a few seconds after the request a
has been made.
Monitoring reveals that tgi_queue_size increases steadily but does not ever go down.
Expected behavior
No stutters.
The text was updated successfully, but these errors were encountered:
I had the same problem, and was able to solve it by trying --cuda-graphs 0 method. This obviously caused major performance problems, but it was at least a better option than being broken.
System Info
OS version: Linux
Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ
Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1xL40S
The current version being used: 2.0.4
Information
Tasks
Reproduction
Launch TGI using
max_total_tokens
=max_batch_prefill_tokens
=16384;max_input_length
=16383;quantize
=awq.After making a few hundred requests, the pod returns empty packets and only a few seconds after the request a
has been made.
Monitoring reveals that
tgi_queue_size
increases steadily but does not ever go down.Expected behavior
No stutters.
The text was updated successfully, but these errors were encountered: