-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support llama.cpp "Multi GPU support, CUDA refactor, CUDA scratch buffer" #344
Comments
Pushed to v0.1.59, let me know if that works. |
Can I use this with the High Level API or is it available only in the Low Level ones? |
Working for me with the high level API.
|
Check |
Here is a brief experimental result. I used ziya, a large-scale pre-trained model based on LLaMA with 13 billion parameters and quantized to q8_0. The model was successfully divided into two GPUs. However, I feel that the same prompts are slower than using single 3090, perhaps need more experiments.
The inference time was slower than before probably because I was running the evaluation code on another card at the same time 😄 |
Yeah, I got worse performance using two GPUs than using one GPU as well. However, my second GPU is a 1080Ti, so I assumed it was my environment. |
@wyzhangyuhan I think that is correct behavior? If its like the CPU/GPU spit, the gpus will run sequentially, not in parallel. It will run some on one 3090, and then on the other. But I think you can fit Q4 65B on 2x 3090s, with very minimal quality loss 👍 |
Multi-GPU inference is essential for small VRAM GPU. 13B llama model cannot fit in a single 3090 unless using quantization.
llama.cpp yesterday merge multi gpu branch, which help us using small VRAM GPUS to deploy LLM.
ggerganov/llama.cpp#1703
Hope llama-cpp-python can support multi GPU inference in the future.
Many thanks!!!
The text was updated successfully, but these errors were encountered: