Description
What happened?
Hello, llama.cpp experts! Thank you for creating such an amazing LLM Inference system. 😁
However, while using this system, I encountered an unusual results when checking the speculative decoding output.
I believe the observed issue is a bug and reporting it as a Bug ISSUE on this github project.
First of all, I want to provide a configuration of my system.
- OS: ubuntu 22.04
- CUDA: 12.4
- GPU: A100 80GB
Next, I will explain the steps I took to download and run the model until the bug occurred.
It was somewhat challenging to use the llama.cpp systems.
# download draft model
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir=./llama-1.1b
./venv/bin/python3 convert_hf_to_gguf.py ./llama-1.1b
# download target model
huggingface-cli download NousResearch/Llama-2-7b-hf --local-dir=./llama-7b
./venv/bin/python3 convert_hf_to_gguf.py ./llama-7b
# run llama-speculative
./build/bin/llama-speculative -m ./llama-7b/ggml-model-f16.gguf -md ./llama-1.1b/ggml-model-f16.gguf -p "Making cake is like" -e -ngl 100 -ngld 100 -t 4 --temp 1.0 -n 128 -c 4096 -s 20 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5
And the printed result is as follows:
draft:
llama_print_timings: load time = 4430.64 ms
llama_print_timings: sample time = 897.28 ms / 555 runs ( 1.62 ms per token, 618.54 tokens per second)
llama_print_timings: prompt eval time = 9531.68 ms / 228 tokens ( 41.81 ms per token, 23.92 tokens per second)
llama_print_timings: eval time = 1968.11 ms / 444 runs ( 4.43 ms per token, 225.60 tokens per second)
llama_print_timings: total time = 19874.43 ms / 672 tokens
target:
llama_print_timings: load time = 26494.43 ms
llama_print_timings: sample time = 1337.68 ms / 112 runs ( 11.94 ms per token, 83.73 tokens per second)
llama_print_timings: prompt eval time = 1840.43 ms / 673 tokens ( 2.73 ms per token, 365.68 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 24380.18 ms / 674 tokens
Here, unlike #3649, I got the inf
eval time of the target model.
I am currently comparing the generation phase latency of the draft model and the target model in Speculative Decoding.
So far, I have used llama-bench
and llama-cli
to measure tokens per second for each model, and the results have been different (e.g. the latency ratio measured with llama-bench
was significanlty larger than that measured with llama-cli
).
Therefore I attempted additional measurements with llama-speculative
, but I obtained an unusual value of inf
. I would like to request confirmation on whether this measurement result is a bug or if it is expected behavior of llama.cpp. 🙏
Name and Version
./build/bin/llama-cli --version
version: 3392 (bda62d79) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response