Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt processing speed is bottlenecked by hardcoded upper batch size limit of 128 #2605

Open
cebtenzzre opened this issue Jul 8, 2024 · 0 comments
Labels
backend gpt4all-backend issues chat gpt4all-chat issues python-bindings gpt4all-bindings Python specific issues

Comments

@cebtenzzre
Copy link
Member

Hardcoded here:

#define LLMODEL_MAX_PROMPT_BATCH 128

With llama.cpp's main example on my Tesla P40 with the CUDA backend, I can measure a 3x increase in prompt processing speed with a batch size of 512 compared to 128.

The Python bindings default to a batch size of 9, which is even worse.



A related problem is that the llama.cpp change that introduced n_ubatch made it important that the batch size does not increase beyond some known threshold (e.g., a new hardcoded upper limit of 512) after load time. Slack discussion:

cebtenzzre: This llama.cpp commit which came with GPT4All v2.8.0 conflicts with the way we currently manage the batch size - ours can change between messages, whereas n_ubatch is a constant property of the llama_context which we only create once per model load. Currently the effective batch size will be capped to 512 since we don't change n_ubatch from the default for text completion models.

manyoso: So n_ubatch has a different semantic?

cebtenzzre:
Old behavior:

  • n_batch is a constant property of the context that defaults to 512, you are supposed to set it and pass at most n_batch tokens to llama_decode. We never set it, but given the way we call llama_decode it is more or less a hint for memory allocation and it being incorrect doesn't really hurt AFAICT.
    New behavior:
  • n_batch now defaults to 2048 and still represents the maximum number of tokens you are promising to send to llama_decode
  • n_ubatch is added to support parallel inference of a model by multiple users in the llama.cpp server example. It defaults to 512, and llama_decode breaks its input into n_ubatch-sized chunks. Software that uses llama.cpp can now use one llama_decode call to do inference on multiple n_ubatch-sized batches in parallel. This doesn't affect users of the llama.cpp main example because for them it defaults to n_batch, which is a constant passed via cli args.

cebtenzzre: The batch size defaults to 128 for the chat UI and 9 for the python bindings (for whatever reason), so this only really affects users who change it

apage43: iirc 9 was from very early on when processing even 128 tokens of prompt context took a while on cpu and 9 was chosen to be small enough so that if you canceled prompt processing the gui would be able respond to that within a second or two - the bindings would have just picked it up from the gui code when they were added and they never got changed even though the UI has changed

apage43: there is, as far as I know, not any reason for it to stay so small

cebtenzzre: they diverged in #840

@cebtenzzre cebtenzzre added backend gpt4all-backend issues python-bindings gpt4all-bindings Python specific issues chat gpt4all-chat issues labels Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend gpt4all-backend issues chat gpt4all-chat issues python-bindings gpt4all-bindings Python specific issues
Projects
None yet
Development

No branches or pull requests

1 participant