Speculative Decoding is slower than expected on A100

Thanks for the great project! I am benchmarking the performance of llamacpp with speculative decoding.
- model setting: draft model [llama-160m](https://huggingface.co/JackFram/llama-160m/tree/main), target model llama7b.
When I benchmark on Mac M1 chip, the results look great: speculative decoding increases the speed from ~12 tokens/s to ~16 tokens/s.
However, the performance is not very good on A100. Concretely,  the speed of the target model and draft model are:
```
draft:

llama_print_timings:        load time =      65.11 ms
llama_print_timings:      sample time =     524.95 ms /     1 runs   (  524.95 ms per token,     1.90 tokens per second)
llama_print_timings: prompt eval time =       8.59 ms /    94 tokens (    0.09 ms per token, 10946.78 tokens per second)
llama_print_timings:        eval time =     322.80 ms /   216 runs   (    1.49 ms per token,   669.15 tokens per second)
llama_print_timings:       total time =    2924.72 ms

target:

llama_print_timings:        load time =    1144.77 ms
llama_print_timings:      sample time =       4.02 ms /   259 runs   (    0.02 ms per token, 64411.84 tokens per second)
llama_print_timings: prompt eval time =    1939.02 ms /   351 tokens (    5.52 ms per token,   181.02 tokens per second)
llama_print_timings:        eval time =      13.19 ms /     1 runs   (   13.19 ms per token,    75.82 tokens per second)
llama_print_timings:       total time =    2999.59 ms
```
I am using greedy decoding and disabling all the heuristics (fix `n_draft`, always propose `n_draft` tokens and avoid early stopping). My execution cmd is:
```
./build/bin/speculative \
-ngl 1000 \
-ngld 100 \
-m /data/model/llama-7b/ggml-model-f16.gguf \
-md /data/model/lama-160m/ggml-model-f16.gguf \
-p "${prompt}" \
-e --temp "-1" -n 256 -s 1 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5
```

When token acceptance rate is 0.44, speculative decoding is actually slower (notice 50 tokens/s < 75 tokens/s)
```
encoded   94 tokens in    0.076 seconds, speed: 1231.914 t/s
decoded  108 tokens in    2.145 seconds, speed:   50.341 t/s

n_draft   = 5
n_predict = 108
n_drafted = 165
n_accept  = 74
accept    = 44.848%
```
However, based on the original speculative [paper](https://proceedings.mlr.press/v202/leviathan23a/leviathan23a.pdf), the speedup should be:
<img width="160" alt="Screen Shot 2023-10-16 at 9 10 58 PM" src="https://github.com/ggerganov/llama.cpp/assets/16137495/c3177759-70a0-4549-b3db-14f7e3ea15d2">
where `alpha` is the token acceptance rate, `gamma` is the number of tokens proposed each step, and `c` is the ratio between the execution times of the draft and target models. In the example above, `c` is roughly `76/669=0.11`.
Plugin in the numbers above, the expected speedup should be:
`(1-0.44^6)/[(1-0.44)*(0.11*0.44+1)]=1.69x`.
However, the benchmarking results show that it's actually `50/76=0.66x`.

To debug this, I set the token acceptance rate to 100% by removing the `id==draft_id[i_dft]` [here](https://github.com/ggerganov/llama.cpp/blob/940efa95fec0b8a98c226a889d2ad839dfeeae0d/examples/speculative/speculative.cpp#L158). After doing this, I observe that the speed is ~90tokens/s, which brings `90/76=1.18x` speedup. However, this is much smaller than the calculation with the formula above (I use 0.99 as the token acceptance rate instead of 1):
`(1-0.99^6)/[(1-0.99)*(0.11*0.99+1)]=5.27x`.

I wonder which part of the speculative decoding might cause big overhead, any comments are highly appreciated! Thanks!





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative Decoding is slower than expected on A100 #3649

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speculative Decoding is slower than expected on A100 #3649

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions