Bench token generation at long context sizes #10936

Mushoz · 2024-12-21T13:45:33Z

Mushoz
Dec 21, 2024

Is there a way to benchmark models at long context sizes? I am trying to use llama-bench, but I am not able to figure out if it's even possible. I have tried:

Setting a big -p value. This only seems to bench how long it takes to prompt process these tokens. The token generation seems to be with a short context, as it's just as fast as having a small -p value
Setting 2 -pg values. While this does seem to do what I want (process a long prompt and then generate tokens), it only lists the combined speed in tokens/sec.

What I want to do, is compare how fast my model is able to generate say 128 tokens with a short 128 token prompt, versus how long it takes to generate those same 128 tokens, with 15k tokens in the context. But I am not able to figure out how to do that. Does anyone know if this is even possible? Thanks!

Answered by ggerganov

Dec 21, 2024

You can get these numbers with the llama-batched-bench tool, although it does not compute uncertainties like the llama-bench tool.

View full answer

ggerganov · 2024-12-21T14:48:15Z

ggerganov
Dec 21, 2024
Maintainer

You can get these numbers with the llama-batched-bench tool, although it does not compute uncertainties like the llama-bench tool.

1 reply

Mushoz Dec 22, 2024
Author

This is perfect, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bench token generation at long context sizes #10936

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Bench token generation at long context sizes #10936

Mushoz Dec 21, 2024

Replies: 1 comment · 1 reply

ggerganov Dec 21, 2024 Maintainer

Mushoz Dec 22, 2024 Author

Mushoz
Dec 21, 2024

Replies: 1 comment 1 reply

ggerganov
Dec 21, 2024
Maintainer

Mushoz Dec 22, 2024
Author