Skip to content

Speculative Decoding is slower than expected on A100 #3649

Closed
@LiuXiaoxuanPKU

Description

@LiuXiaoxuanPKU

Thanks for the great project! I am benchmarking the performance of llamacpp with speculative decoding.

  • model setting: draft model llama-160m, target model llama7b.
    When I benchmark on Mac M1 chip, the results look great: speculative decoding increases the speed from ~12 tokens/s to ~16 tokens/s.
    However, the performance is not very good on A100. Concretely, the speed of the target model and draft model are:
draft:

llama_print_timings:        load time =      65.11 ms
llama_print_timings:      sample time =     524.95 ms /     1 runs   (  524.95 ms per token,     1.90 tokens per second)
llama_print_timings: prompt eval time =       8.59 ms /    94 tokens (    0.09 ms per token, 10946.78 tokens per second)
llama_print_timings:        eval time =     322.80 ms /   216 runs   (    1.49 ms per token,   669.15 tokens per second)
llama_print_timings:       total time =    2924.72 ms

target:

llama_print_timings:        load time =    1144.77 ms
llama_print_timings:      sample time =       4.02 ms /   259 runs   (    0.02 ms per token, 64411.84 tokens per second)
llama_print_timings: prompt eval time =    1939.02 ms /   351 tokens (    5.52 ms per token,   181.02 tokens per second)
llama_print_timings:        eval time =      13.19 ms /     1 runs   (   13.19 ms per token,    75.82 tokens per second)
llama_print_timings:       total time =    2999.59 ms

I am using greedy decoding and disabling all the heuristics (fix n_draft, always propose n_draft tokens and avoid early stopping). My execution cmd is:

./build/bin/speculative \
-ngl 1000 \
-ngld 100 \
-m /data/model/llama-7b/ggml-model-f16.gguf \
-md /data/model/lama-160m/ggml-model-f16.gguf \
-p "${prompt}" \
-e --temp "-1" -n 256 -s 1 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5

When token acceptance rate is 0.44, speculative decoding is actually slower (notice 50 tokens/s < 75 tokens/s)

encoded   94 tokens in    0.076 seconds, speed: 1231.914 t/s
decoded  108 tokens in    2.145 seconds, speed:   50.341 t/s

n_draft   = 5
n_predict = 108
n_drafted = 165
n_accept  = 74
accept    = 44.848%

However, based on the original speculative paper, the speedup should be:
Screen Shot 2023-10-16 at 9 10 58 PM
where alpha is the token acceptance rate, gamma is the number of tokens proposed each step, and c is the ratio between the execution times of the draft and target models. In the example above, c is roughly 76/669=0.11.
Plugin in the numbers above, the expected speedup should be:
(1-0.44^6)/[(1-0.44)*(0.11*0.44+1)]=1.69x.
However, the benchmarking results show that it's actually 50/76=0.66x.

To debug this, I set the token acceptance rate to 100% by removing the id==draft_id[i_dft] here. After doing this, I observe that the speed is ~90tokens/s, which brings 90/76=1.18x speedup. However, this is much smaller than the calculation with the formula above (I use 0.99 as the token acceptance rate instead of 1):
(1-0.99^6)/[(1-0.99)*(0.11*0.99+1)]=5.27x.

I wonder which part of the speculative decoding might cause big overhead, any comments are highly appreciated! Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions