Prompt processing speed is bottlenecked by hardcoded upper batch size limit of 128 #2605
Labels
backend
gpt4all-backend issues
chat
gpt4all-chat issues
python-bindings
gpt4all-bindings Python specific issues
Hardcoded here:
gpt4all/gpt4all-backend/llmodel.h
Line 19 in c11e0f4
With llama.cpp's
main
example on my Tesla P40 with the CUDA backend, I can measure a 3x increase in prompt processing speed with a batch size of 512 compared to 128.The Python bindings default to a batch size of 9, which is even worse.
A related problem is that the llama.cpp change that introduced n_ubatch made it important that the batch size does not increase beyond some known threshold (e.g., a new hardcoded upper limit of 512) after load time. Slack discussion:
The text was updated successfully, but these errors were encountered: