-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outdated benchmarks #381
Comments
For example, running your throughput benchmark on Llama-7B on a single A10, we get a throughput of 112 req/min with tgi. docker run --gpus all -p 3000:80 -v /data:/data ghcr.io/huggingface/text-generation-inference:0.9.1 --model-id /data/llama-7b --num-shard 1 --max-batch-total-tokens 17664 --max-batch-prefill-tokens 2048 --max-waiting-tokens 0 |
@OlivierDehaene Thanks for your support on PagedAttention! We will test the performance of the latest TGI and update the figure accordingly. |
It would be awesome to include some latency numbers too in addition to just throughput! |
Yeah, the benchmark will be very interesting and useful! One side question, how do you get the --max-batch-total-tokens 17664 in TGI? |
What do you mean? |
I guess the question is how did you determine this specific limit for max batch total tokens? |
Closing because README no longer contains benchmark results. |
Hello!
Paged Attention was added to text-generation-inference in 0.9 and all the benchmarks you display on the your README are now out dated.
Any chance you could update them?
Cheers
The text was updated successfully, but these errors were encountered: