-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative sampling #675
Comments
@andriyanthon good idea, I'll take a look into this. I think a similar API to hugginface's Assisted Generation would work well. |
+1. Would probably double performance in my setup. |
+1. It would be very useful |
Any updates on this? |
+1 Also added the On hardware:
For me, the acceleration amounted to phind-codellama-34b-v2.Q4_K_M.gguf
An example of the full CLI in llama.cpp and the results for me are below:
And the results:
|
I have made some speculative decoding tests with the following models on my RTX 3090:
With speculative, I get 3.41 tokens/second, while without it I get 2.08 tokens/second. That's a +64% increase. This is the command that I used: ./speculative \
-m ../models/wizardlm-70b-v1.0.Q4_K_S.gguf \
-md ../models/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf \
-e \
-t 6 \
-tb 12 \
-n 256 \
-c 4096 \
--draft 15 \
-ngld 128 \
-ngl 42 \
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Give me an example of Python script.\nASSISTANT:" Having this feature available in llama-cpp-python would be amazing. |
This features looks so cool : ) Looking forward to this! |
#1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding. |
This features looks so cool! How can we make it support more speculative decoding? not just prompt lookup decoding. |
llama.cpp added a feature for speculative inference:
ggerganov/llama.cpp#2926
but when running llama_cpp.server, it says it does not recognize the new parameters.
There are two new parameters:
Can this new feature please be supported?
The text was updated successfully, but these errors were encountered: