Description
Thanks for the great project! I am benchmarking the performance of llamacpp with speculative decoding.
- model setting: draft model llama-160m, target model llama7b.
When I benchmark on Mac M1 chip, the results look great: speculative decoding increases the speed from ~12 tokens/s to ~16 tokens/s.
However, the performance is not very good on A100. Concretely, the speed of the target model and draft model are:
draft:
llama_print_timings: load time = 65.11 ms
llama_print_timings: sample time = 524.95 ms / 1 runs ( 524.95 ms per token, 1.90 tokens per second)
llama_print_timings: prompt eval time = 8.59 ms / 94 tokens ( 0.09 ms per token, 10946.78 tokens per second)
llama_print_timings: eval time = 322.80 ms / 216 runs ( 1.49 ms per token, 669.15 tokens per second)
llama_print_timings: total time = 2924.72 ms
target:
llama_print_timings: load time = 1144.77 ms
llama_print_timings: sample time = 4.02 ms / 259 runs ( 0.02 ms per token, 64411.84 tokens per second)
llama_print_timings: prompt eval time = 1939.02 ms / 351 tokens ( 5.52 ms per token, 181.02 tokens per second)
llama_print_timings: eval time = 13.19 ms / 1 runs ( 13.19 ms per token, 75.82 tokens per second)
llama_print_timings: total time = 2999.59 ms
I am using greedy decoding and disabling all the heuristics (fix n_draft
, always propose n_draft
tokens and avoid early stopping). My execution cmd is:
./build/bin/speculative \
-ngl 1000 \
-ngld 100 \
-m /data/model/llama-7b/ggml-model-f16.gguf \
-md /data/model/lama-160m/ggml-model-f16.gguf \
-p "${prompt}" \
-e --temp "-1" -n 256 -s 1 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5
When token acceptance rate is 0.44, speculative decoding is actually slower (notice 50 tokens/s < 75 tokens/s)
encoded 94 tokens in 0.076 seconds, speed: 1231.914 t/s
decoded 108 tokens in 2.145 seconds, speed: 50.341 t/s
n_draft = 5
n_predict = 108
n_drafted = 165
n_accept = 74
accept = 44.848%
However, based on the original speculative paper, the speedup should be:
where alpha
is the token acceptance rate, gamma
is the number of tokens proposed each step, and c
is the ratio between the execution times of the draft and target models. In the example above, c
is roughly 76/669=0.11
.
Plugin in the numbers above, the expected speedup should be:
(1-0.44^6)/[(1-0.44)*(0.11*0.44+1)]=1.69x
.
However, the benchmarking results show that it's actually 50/76=0.66x
.
To debug this, I set the token acceptance rate to 100% by removing the id==draft_id[i_dft]
here. After doing this, I observe that the speed is ~90tokens/s, which brings 90/76=1.18x
speedup. However, this is much smaller than the calculation with the formula above (I use 0.99 as the token acceptance rate instead of 1):
(1-0.99^6)/[(1-0.99)*(0.11*0.99+1)]=5.27x
.
I wonder which part of the speculative decoding might cause big overhead, any comments are highly appreciated! Thanks!